SamplingParams class

Reference for the SamplingParams class, default sampling parameters from generation_config.json, and token generation examples.

classSamplingParams

Sampling parameters for text generation.

paramself

paramnint

= 1

Number of output sequences to return for the given prompt.

parambest_ofint | None

= None

Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width when use_beam_search is True. By default, best_of is set to n.

paramrepetition_penaltyfloat

= 1.0

Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.

paramtemperaturefloat

= 1.0

Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.

paramtop_pfloat

= 1.0

Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

paramtop_kint

= -1

Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.

parammin_pfloat

= 0.0

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

paramstop_token_idslist[int] | None

= None

Token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.

paramignore_eosbool

= False

Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.

paramuse_beam_searchbool

= False

Whether to use beam search instead of sampling.

paramlength_penaltyfloat

= 1.0

Float that penalizes sequences based on their length. Used in beam search.

paramearly_stoppingbool | str

= False

Controls the stopping condition for beam search. It accepts the following values: True, where the generation stops as soon as there are best_of complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

parammax_tokensint | None

= 16

Maximum number of tokens to generate per output sequence. If the value is None, it is capped to the maximum sequence length.

parammin_tokensint

= 0

Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated

paramskip_special_tokensbool

= True

Whether to skip special tokens in the output.

paramlogprobsint | None

= None

Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response.

paramprompt_logprobsint | None

= None

Number of log probabilities to return per prompt token. When set to None (default), no prompt logprobs are returned. When set to a non-negative integer, returns the top-k log probabilities for each prompt token position, plus the actual token's logprob. Set to -1 to return log probabilities for all vocabulary tokens. Warning: Using -1 can cause significant memory and network overhead as it returns logprobs for the entire vocabulary (e.g., ~150K tokens for Qwen models) at each prompt position.

paramoutput_kindRequestOutputKind

= RequestOutputKind.CUMULATIVE

paramstructured_outputsStructuredOutputsParams | None

= None

Attributes

attributen

= n

attributebest_of

= best_of if best_of is not None else n

attributerepetition_penalty

= repetition_penalty

attributetemperature

= temperature

attributetop_p

= top_p

attributetop_k

= top_k

attributemin_p

= min_p

attributestop_token_ids

= stop_token_ids

attributeignore_eos

= ignore_eos

attributeuse_beam_search

= use_beam_search

attributelength_penalty

= length_penalty

attributeearly_stopping

= early_stopping

attributemax_tokens

= max_tokens

attributemin_tokens

= min_tokens

attributeskip_special_tokens

= skip_special_tokens

attributelogprobs

= 1 if logprobs is True else logprobs

attributeprompt_logprobs

= prompt_logprobs

attributeoutput_kind

= output_kind

attributestructured_outputs

= structured_outputs

attributesampling_typeSamplingType

Methods

methodfrom_optional

(cls, *, n=None, best_of=None, repetition_penalty=1.0, temperature=None, top_p=None, top_k=None, min_p=0.0, stop_token_ids=None, ignore_eos=None, use_beam_search=None, length_penalty=None, early_stopping=None, max_tokens=None, min_tokens=None, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, output_kind=None, structured_outputs=None) -> SamplingParams

paramcls

paramnint | None

= None

parambest_ofint | None

= None

paramrepetition_penaltyfloat | None

= 1.0

paramtemperaturefloat | None

= None

paramtop_pfloat | None

= None

paramtop_kint | None

= None

parammin_pfloat

= 0.0

paramstop_token_idslist[int] | None

= None

paramignore_eosbool | None

= None

paramuse_beam_searchbool | None

= None

paramlength_penaltyfloat | None

= None

paramearly_stoppingbool | str | None

= None

parammax_tokensint | None

= None

parammin_tokensint | None

= None

paramskip_special_tokensbool

= True

paramlogprobsint | None

= None

paramprompt_logprobsint | None

= None

paramoutput_kindRequestOutputKind | None

= None

paramstructured_outputsStructuredOutputsParams | None

= None

Returns

furiosa_llm.sampling_params.SamplingParams

methodclone(self) -> SamplingParams

paramself

Returns

furiosa_llm.sampling_params.SamplingParams

methodstructured_outputs_enabled(self)

paramself

Returns

None

Default Sampling Parameters from `generation_config.json`

If a model artifact contains a generation_config.json file, Furiosa-LLM uses the values in that file as the effective defaults for the fields listed below. The file is copied verbatim from the source Hugging Face model during artifact build — Furiosa-LLM does not customize it. If the file is absent, the plain SamplingParams() defaults apply.

The following generation_config.json keys are honored, and are mapped to SamplingParams fields as shown:

`generation_config.json`	`SamplingParams`
`repetition_penalty`	`repetition_penalty`
`temperature`	`temperature`
`top_k`	`top_k`
`top_p`	`top_p`
`min_p`	`min_p`
`max_new_tokens`	`max_tokens`

Offline API (`LLM.generate` / `LLM.chat` / `LLM.stream_generate`)

The offline API uses a binary decision based on whether the caller passes sampling_params:

sampling_params=None (the default) — the model's generation_config.json defaults are applied. See get_default_sampling_params.
sampling_params=SamplingParams(...) — the user's object is used as-is. No per-field merge with the model's generation_config.json is performed.

OpenAI-compatible server

The Chat Completions, Completions, and Responses endpoints resolve each sampling-related field in three tiers:

The value specified in the request body, if set.
The value from the model's generation_config.json, if present.
The API default shown in the endpoint's parameter table (see Chat API (/v1/chat/completions), Completions API (/v1/completions), and API Reference).

Examples

This section provides examples of how to use the token generation methods available in the SDK.

1. Basic Greedy Search

python

SamplingParams(min_tokens=10, max_tokens=100)

The Basic Greedy Search method generates a sequence of tokens, ensuring that at least min_tokens and up to max_tokens are produced.

Parameters:
- min_tokens: Minimum number of tokens to generate.
- max_tokens: Maximum number of tokens to generate.
Behavior:
- Generation may terminate before reaching max_tokens if an End Of Sequence (EOS) token is generated.
- The EOS token will not be generated before reaching the specified min_tokens.

2. Random Sampling with `top_p` / `top_k` Parameters

python

SamplingParams(min_tokens=10, max_tokens=100, top_p=0.3, top_k=100)

This method uses random sampling techniques for token generation, allowing for diverse outputs.

Parameters:
- min_tokens: Minimum number of tokens to generate.
- max_tokens: Maximum number of tokens to generate.
- top_p: Cumulative probability for nucleus sampling.
- top_k: Number of highest probability tokens to consider.
Behavior:
- Each generation may yield different results, even with the same input text and parameters, enhancing variability.
- Generation may terminate before reaching max_tokens if an End Of Sequence (EOS) token is generated.
- The EOS token will not be generated before reaching the specified min_tokens.

3. Beam Search with `best_of` Beams

python

SamplingParams(min_tokens=10, max_tokens=100, use_beam_search=True, best_of=4)

Beam Search enhances the generation process by exploring multiple sequences simultaneously.

Parameters:
- min_tokens: Minimum number of tokens to generate.
- max_tokens: Maximum number of tokens to generate.
- use_beam_search: Must be set to True to enable beam search.
- best_of: Number of beams to consider for generating the best output.
Behavior:
- The generation process explores multiple possible sequences to determine the best output.
- Generation may terminate before reaching max_tokens if the number of End Of Sequence (EOS) tokens generated across all beams reaches the best_of count.
- The EOS token will not be generated before reaching the specified min_tokens.

SamplingParams class

On this page