SamplingParams class

Reference for the SamplingParams class, default sampling parameters from generation_config.json, and token generation examples.

classSamplingParams

Sampling parameters for text generation.

paramself
paramnint
= 1

Number of output sequences to return for the given prompt.

parambest_ofint | None
= None

Number of output sequences that are generated from the prompt. From these best_of sequences, the top n sequences are returned. best_of must be greater than or equal to n. This is treated as the beam width when use_beam_search is True. By default, best_of is set to n.

paramrepetition_penaltyfloat
= 1.0

Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.

paramtemperaturefloat
= 1.0

Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.

paramtop_pfloat
= 1.0

Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

paramtop_kint
= -1

Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.

parammin_pfloat
= 0.0

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

paramstop_token_idslist[int] | None
= None

Token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.

paramignore_eosbool
= False

Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.

paramuse_beam_searchbool
= False

Whether to use beam search instead of sampling.

paramlength_penaltyfloat
= 1.0

Float that penalizes sequences based on their length. Used in beam search.

paramearly_stoppingbool | str
= False

Controls the stopping condition for beam search. It accepts the following values: True, where the generation stops as soon as there are best_of complete candidates; False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; "never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).

parammax_tokensint | None
= 16

Maximum number of tokens to generate per output sequence. If the value is None, it is capped to the maximum sequence length.

parammin_tokensint
= 0

Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated

paramskip_special_tokensbool
= True

Whether to skip special tokens in the output.

paramlogprobsint | None
= None

Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response.

paramprompt_logprobsint | None
= None

Number of log probabilities to return per prompt token. When set to None (default), no prompt logprobs are returned. When set to a non-negative integer, returns the top-k log probabilities for each prompt token position, plus the actual token's logprob. Set to -1 to return log probabilities for all vocabulary tokens. Warning: Using -1 can cause significant memory and network overhead as it returns logprobs for the entire vocabulary (e.g., ~150K tokens for Qwen models) at each prompt position.

paramoutput_kindRequestOutputKind
= RequestOutputKind.CUMULATIVE
paramstructured_outputsStructuredOutputsParams | None
= None

Attributes

attributen
= n
attributebest_of
= best_of if best_of is not None else n
attributerepetition_penalty
= repetition_penalty
attributetemperature
= temperature
attributetop_p
= top_p
attributetop_k
= top_k
attributemin_p
= min_p
attributestop_token_ids
= stop_token_ids
attributeignore_eos
= ignore_eos
attributeuse_beam_search
= use_beam_search
attributelength_penalty
= length_penalty
attributeearly_stopping
= early_stopping
attributemax_tokens
= max_tokens
attributemin_tokens
= min_tokens
attributeskip_special_tokens
= skip_special_tokens
attributelogprobs
= 1 if logprobs is True else logprobs
attributeprompt_logprobs
= prompt_logprobs
attributeoutput_kind
= output_kind
attributestructured_outputs
= structured_outputs
attributesampling_typeSamplingType

Methods

methodfrom_optional(cls, *, n=None, best_of=None, repetition_penalty=1.0, temperature=None, top_p=None, top_k=None, min_p=0.0, stop_token_ids=None, ignore_eos=None, use_beam_search=None, length_penalty=None, early_stopping=None, max_tokens=None, min_tokens=None, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, output_kind=None, structured_outputs=None) -> SamplingParams
paramcls
paramnint | None
= None
parambest_ofint | None
= None
paramrepetition_penaltyfloat | None
= 1.0
paramtemperaturefloat | None
= None
paramtop_pfloat | None
= None
paramtop_kint | None
= None
parammin_pfloat
= 0.0
paramstop_token_idslist[int] | None
= None
paramignore_eosbool | None
= None
paramuse_beam_searchbool | None
= None
paramlength_penaltyfloat | None
= None
paramearly_stoppingbool | str | None
= None
parammax_tokensint | None
= None
parammin_tokensint | None
= None
paramskip_special_tokensbool
= True
paramlogprobsint | None
= None
paramprompt_logprobsint | None
= None
paramoutput_kindRequestOutputKind | None
= None
paramstructured_outputsStructuredOutputsParams | None
= None

Returns

furiosa_llm.sampling_params.SamplingParams
methodclone(self) -> SamplingParams
paramself

Returns

furiosa_llm.sampling_params.SamplingParams
methodstructured_outputs_enabled(self)
paramself

Returns

None

Default Sampling Parameters from generation_config.json

If a model artifact contains a generation_config.json file, Furiosa-LLM uses the values in that file as the effective defaults for the fields listed below. The file is copied verbatim from the source Hugging Face model during artifact build — Furiosa-LLM does not customize it. If the file is absent, the plain SamplingParams() defaults apply.

The following generation_config.json keys are honored, and are mapped to SamplingParams fields as shown:

generation_config.jsonSamplingParams
repetition_penaltyrepetition_penalty
temperaturetemperature
top_ktop_k
top_ptop_p
min_pmin_p
max_new_tokensmax_tokens

Offline API (LLM.generate / LLM.chat / LLM.stream_generate)

The offline API uses a binary decision based on whether the caller passes sampling_params:

  • sampling_params=None (the default) — the model's generation_config.json defaults are applied. See get_default_sampling_params.
  • sampling_params=SamplingParams(...) — the user's object is used as-is. No per-field merge with the model's generation_config.json is performed.

OpenAI-compatible server

The Chat Completions, Completions, and Responses endpoints resolve each sampling-related field in three tiers:

  1. The value specified in the request body, if set.
  2. The value from the model's generation_config.json, if present.
  3. The API default shown in the endpoint's parameter table (see Chat API (/v1/chat/completions), Completions API (/v1/completions), and API Reference).

Examples

This section provides examples of how to use the token generation methods available in the SDK.

python
SamplingParams(min_tokens=10, max_tokens=100)

The Basic Greedy Search method generates a sequence of tokens, ensuring that at least min_tokens and up to max_tokens are produced.

  • Parameters:

    • min_tokens: Minimum number of tokens to generate.
    • max_tokens: Maximum number of tokens to generate.
  • Behavior:

    • Generation may terminate before reaching max_tokens if an End Of Sequence (EOS) token is generated.
    • The EOS token will not be generated before reaching the specified min_tokens.

2. Random Sampling with top_p / top_k Parameters

python
SamplingParams(min_tokens=10, max_tokens=100, top_p=0.3, top_k=100)

This method uses random sampling techniques for token generation, allowing for diverse outputs.

  • Parameters:

    • min_tokens: Minimum number of tokens to generate.
    • max_tokens: Maximum number of tokens to generate.
    • top_p: Cumulative probability for nucleus sampling.
    • top_k: Number of highest probability tokens to consider.
  • Behavior:

    • Each generation may yield different results, even with the same input text and parameters, enhancing variability.
    • Generation may terminate before reaching max_tokens if an End Of Sequence (EOS) token is generated.
    • The EOS token will not be generated before reaching the specified min_tokens.

3. Beam Search with best_of Beams

python
SamplingParams(min_tokens=10, max_tokens=100, use_beam_search=True, best_of=4)

Beam Search enhances the generation process by exploring multiple sequences simultaneously.

  • Parameters:

    • min_tokens: Minimum number of tokens to generate.
    • max_tokens: Maximum number of tokens to generate.
    • use_beam_search: Must be set to True to enable beam search.
    • best_of: Number of beams to consider for generating the best output.
  • Behavior:

    • The generation process explores multiple possible sequences to determine the best output.
    • Generation may terminate before reaching max_tokens if the number of End Of Sequence (EOS) tokens generated across all beams reaches the best_of count.
    • The EOS token will not be generated before reaching the specified min_tokens.

On this page