SamplingParams class
Reference for the SamplingParams class, default sampling parameters from generation_config.json, and token generation examples.
classSamplingParamsSampling parameters for text generation.
paramselfparamnint= 1Number of output sequences to return for the given prompt.
parambest_ofint | None= NoneNumber of output sequences that are generated from the prompt.
From these best_of sequences, the top n sequences are returned.
best_of must be greater than or equal to n. This is treated as
the beam width when use_beam_search is True. By default, best_of
is set to n.
paramrepetition_penaltyfloat= 1.0Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
paramtemperaturefloat= 1.0Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
paramtop_pfloat= 1.0Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
paramtop_kint= -1Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.
parammin_pfloat= 0.0Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.
paramstop_token_idslist[int] | None= NoneToken IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.
paramignore_eosbool= FalseWhether to ignore the EOS token and continue generating tokens after the EOS token is generated.
paramuse_beam_searchbool= FalseWhether to use beam search instead of sampling.
paramlength_penaltyfloat= 1.0Float that penalizes sequences based on their length. Used in beam search.
paramearly_stoppingbool | str= FalseControls the stopping condition for beam search. It
accepts the following values: True, where the generation stops as
soon as there are best_of complete candidates; False, where an
heuristic is applied and the generation stops when is it very
unlikely to find better candidates; "never", where the beam search
procedure only stops when there cannot be better candidates
(canonical beam search algorithm).
parammax_tokensint | None= 16Maximum number of tokens to generate per output sequence. If the value is None, it is capped to the maximum sequence length.
parammin_tokensint= 0Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated
paramskip_special_tokensbool= TrueWhether to skip special tokens in the output.
paramlogprobsint | None= NoneNumber of log probabilities to return per output token.
When set to None, no probability is returned. If set to a non-None
value, the result includes the log probabilities of the specified
number of most likely tokens, as well as the chosen tokens.
Note that the implementation follows the OpenAI API: The API will
always return the log probability of the sampled token, so there
may be up to logprobs+1 elements in the response.
paramprompt_logprobsint | None= NoneNumber of log probabilities to return per prompt token. When set to None (default), no prompt logprobs are returned. When set to a non-negative integer, returns the top-k log probabilities for each prompt token position, plus the actual token's logprob. Set to -1 to return log probabilities for all vocabulary tokens. Warning: Using -1 can cause significant memory and network overhead as it returns logprobs for the entire vocabulary (e.g., ~150K tokens for Qwen models) at each prompt position.
paramoutput_kindRequestOutputKind= RequestOutputKind.CUMULATIVEparamstructured_outputsStructuredOutputsParams | None= NoneAttributes
attributen= nattributebest_of= best_of if best_of is not None else nattributerepetition_penalty= repetition_penaltyattributetemperature= temperatureattributetop_p= top_pattributetop_k= top_kattributemin_p= min_pattributestop_token_ids= stop_token_idsattributeignore_eos= ignore_eosattributeuse_beam_search= use_beam_searchattributelength_penalty= length_penaltyattributeearly_stopping= early_stoppingattributemax_tokens= max_tokensattributemin_tokens= min_tokensattributeskip_special_tokens= skip_special_tokensattributelogprobs= 1 if logprobs is True else logprobsattributeprompt_logprobs= prompt_logprobsattributeoutput_kind= output_kindattributestructured_outputs= structured_outputsattributesampling_typeSamplingTypeMethods
methodfrom_optional(cls, *, n=None, best_of=None, repetition_penalty=1.0, temperature=None, top_p=None, top_k=None, min_p=0.0, stop_token_ids=None, ignore_eos=None, use_beam_search=None, length_penalty=None, early_stopping=None, max_tokens=None, min_tokens=None, skip_special_tokens=True, logprobs=None, prompt_logprobs=None, output_kind=None, structured_outputs=None) -> SamplingParamsparamclsparamnint | None= Noneparambest_ofint | None= Noneparamrepetition_penaltyfloat | None= 1.0paramtemperaturefloat | None= Noneparamtop_pfloat | None= Noneparamtop_kint | None= Noneparammin_pfloat= 0.0paramstop_token_idslist[int] | None= Noneparamignore_eosbool | None= Noneparamuse_beam_searchbool | None= Noneparamlength_penaltyfloat | None= Noneparamearly_stoppingbool | str | None= Noneparammax_tokensint | None= Noneparammin_tokensint | None= Noneparamskip_special_tokensbool= Trueparamlogprobsint | None= Noneparamprompt_logprobsint | None= Noneparamoutput_kindRequestOutputKind | None= Noneparamstructured_outputsStructuredOutputsParams | None= NoneReturns
furiosa_llm.sampling_params.SamplingParamsmethodclone(self) -> SamplingParamsparamselfReturns
furiosa_llm.sampling_params.SamplingParamsmethodstructured_outputs_enabled(self)paramselfReturns
NoneDefault Sampling Parameters from generation_config.json
If a model artifact contains a generation_config.json file, Furiosa-LLM
uses the values in that file as the effective defaults for the fields listed
below. The file is copied verbatim from the source Hugging Face model during
artifact build — Furiosa-LLM does not customize it. If the file is absent,
the plain SamplingParams() defaults apply.
The following generation_config.json keys are honored, and are mapped to
SamplingParams fields as shown:
generation_config.json | SamplingParams |
|---|---|
repetition_penalty | repetition_penalty |
temperature | temperature |
top_k | top_k |
top_p | top_p |
min_p | min_p |
max_new_tokens | max_tokens |
Offline API (LLM.generate / LLM.chat / LLM.stream_generate)
The offline API uses a binary decision based on whether the caller
passes sampling_params:
sampling_params=None(the default) — the model'sgeneration_config.jsondefaults are applied. Seeget_default_sampling_params.sampling_params=SamplingParams(...)— the user's object is used as-is. No per-field merge with the model'sgeneration_config.jsonis performed.
OpenAI-compatible server
The Chat Completions, Completions, and Responses endpoints resolve each sampling-related field in three tiers:
- The value specified in the request body, if set.
- The value from the model's
generation_config.json, if present. - The API default shown in the endpoint's parameter table
(see Chat API (
/v1/chat/completions), Completions API (/v1/completions), and API Reference).
Examples
This section provides examples of how to use the token generation methods available in the SDK.
1. Basic Greedy Search
SamplingParams(min_tokens=10, max_tokens=100)The Basic Greedy Search method generates a sequence of tokens, ensuring that at least min_tokens and up to max_tokens are produced.
-
Parameters:
min_tokens: Minimum number of tokens to generate.max_tokens: Maximum number of tokens to generate.
-
Behavior:
- Generation may terminate before reaching
max_tokensif an End Of Sequence (EOS) token is generated. - The EOS token will not be generated before reaching the specified
min_tokens.
- Generation may terminate before reaching
2. Random Sampling with top_p / top_k Parameters
SamplingParams(min_tokens=10, max_tokens=100, top_p=0.3, top_k=100)This method uses random sampling techniques for token generation, allowing for diverse outputs.
-
Parameters:
min_tokens: Minimum number of tokens to generate.max_tokens: Maximum number of tokens to generate.top_p: Cumulative probability for nucleus sampling.top_k: Number of highest probability tokens to consider.
-
Behavior:
- Each generation may yield different results, even with the same input text and parameters, enhancing variability.
- Generation may terminate before reaching
max_tokensif an End Of Sequence (EOS) token is generated. - The EOS token will not be generated before reaching the specified
min_tokens.
3. Beam Search with best_of Beams
SamplingParams(min_tokens=10, max_tokens=100, use_beam_search=True, best_of=4)Beam Search enhances the generation process by exploring multiple sequences simultaneously.
-
Parameters:
min_tokens: Minimum number of tokens to generate.max_tokens: Maximum number of tokens to generate.use_beam_search: Must be set to True to enable beam search.best_of: Number of beams to consider for generating the best output.
-
Behavior:
- The generation process explores multiple possible sequences to determine the best output.
- Generation may terminate before reaching
max_tokensif the number of End Of Sequence (EOS) tokens generated across all beams reaches thebest_ofcount. - The EOS token will not be generated before reaching the specified
min_tokens.