LLM class
API reference for the furiosa_llm.LLM class and its key methods for text generation, embedding, and scoring.
classLLMAn LLM for generating texts from given prompts and sampling parameters.
paramselfparammodel_id_or_pathstr | os.PathLikeparamfxbstr | os.PathLike | None= Noneparamrevisionstr | None= Noneparamdevicesstr | Sequence[Device] | None= Noneparamdata_parallel_sizeint | None= Noneparampipeline_parallel_sizeint | None= Noneparamnum_blocks_per_pp_stageSequence[int] | None= Noneparammax_io_memory_mbint= 2048paramscheduler_configSchedulerConfig | None= Noneparamstructured_outputs_backendLiteral['auto', 'guidance', 'xgrammar']= 'auto'paramtokenizerstr | PreTrainedTokenizer | PreTrainedTokenizerFast | None= Noneparamtokenizer_modeTokenizerModeType= 'auto'paramseedint | None= Noneparamcache_diros.PathLike= CACHE_DIRparamskip_enginebool= Falseparamenable_jit_compilationbool= Falseparamjit_thresholdint= DEFAULT_JIT_THRESHOLDparamjit_max_workersint= DEFAULT_JIT_MAX_WORKERSparamjit_unit_sizeint= DEFAULT_JIT_UNIT_SIZEparamserved_model_namestr | None= Noneparamkwargs= {}Attributes
attributemax_model_lenintattributeengineNativeEngineLikeMethods
methodload_artifact(cls, model_id_or_path, **kwargs) -> LLMDeprecated: Use LLM() constructor directly.
This method is kept for backward compatibility and will be removed in a future release.
paramclsparammodel_id_or_pathstr | os.PathLikeparamkwargs= {}Returns
furiosa_llm.api.LLMmethodget_default_sampling_params(self) -> SamplingParamsReturn SamplingParams reflecting model's generation_config defaults.
If the model has no generation_config or it matches HF defaults, returns a default SamplingParams().
paramselfReturns
furiosa_llm.sampling_params.SamplingParamsmethodgenerate(self, prompts, sampling_params=None, prompt_token_ids=None, tokenizer_kwargs=None) -> RequestOutput | list[RequestOutput]Generate texts from given prompts and sampling parameters.
paramselfparampromptsstr | list[str]The prompts to generate texts.
paramsampling_paramsSamplingParams | None= NoneThe sampling parameters for generating texts. If None, model's generation config defaults are used.
paramprompt_token_idsBatchEncoding | None= NonePre-tokenized prompt input as a BatchEncoding object.
If not provided, the prompt will be tokenized internally using the tokenizer.
paramtokenizer_kwargsdict[str, Any] | None= NoneAdditional keyword arguments passed to the tokenizer's
encode method, such as \{"use_special_tokens": True\}.
Returns
RequestOutput | list[RequestOutput]A list of RequestOutput objects containing the generated
methodchat(self, messages, sampling_params=None, chat_template=None, chat_template_content_format='string', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None) -> list[RequestOutput]Generate responses for a chat conversation.
The chat conversation is converted into a text prompt using the
tokenizer and calls the generate method to generate the
responses.
paramselfparammessageslist[ChatCompletionMessageParam] | list[list[ChatCompletionMessageParam]]A list of conversations or a single conversation.
- Each conversation is represented as a list of messages.
- Each message is a dictionary with 'role' and 'content' keys.
paramsampling_paramsSamplingParams | None= NoneThe sampling parameters for text generation.
paramchat_templatestr | None= NoneThe template to use for structuring the chat. If not provided, the model's default chat template will be used.
paramchat_template_content_formatChatTemplateContentFormatOption= 'string'The format to render message content. Currently only "string" is supported.
paramadd_generation_promptbool= TrueIf True, adds a generation template to each message.
paramcontinue_final_messagebool= FalseIf True, continues the final message in
the conversation instead of starting a new one. Cannot be
True if add_generation_prompt is also True.
paramtoolslist[dict[str, Any]] | None= NoneOptional list of tools to use in the chat.
paramchat_template_kwargsdict[str, Any] | None= NoneAdditional keyword arguments to pass to the chat template rendering function.
Returns
listA list of RequestOutput objects containing the generated
methodstream_generate(self, prompt, sampling_params=None, prompt_token_ids=None, tokenizer_kwargs=None, is_demo=False) -> AsyncGenerator[str, None]Generate texts from given prompt and sampling parameters.
paramselfparampromptstrThe prompt to generate texts. Note that unlike generate,
this API supports only a single prompt.
paramsampling_paramsSamplingParams | None= NoneThe sampling parameters for generating texts.
paramprompt_token_idsBatchEncoding | None= NonePre-tokenized prompt input as a BatchEncoding object.
If not provided, the prompt will be tokenized internally using the tokenizer.
paramtokenizer_kwargsdict[str, Any] | None= NoneAdditional keyword arguments passed to the tokenizer's
encode method, such as \{"use_special_tokens": True\}.
paramis_demobool= FalseReturns
collections.abc.AsyncGeneratorA stream of generated output tokens.
methodencode(self, prompts, pooling_params=None, *, pooling_task=None) -> list[PoolingRequestOutput]Apply pooling to the hidden states corresponding to the input prompts.
paramselfparampromptsPromptType | Sequence[PromptType]The prompts to the LLM. You may pass a sequence of prompts for batch inference.
parampooling_paramsPoolingParams | Sequence[PoolingParams] | None= NoneThe pooling parameters for pooling.
parampooling_taskPoolingTask | None= NoneOverride the pooling task to use.
Returns
listA list of PoolingRequestOutput objects containing the
methodembed(self, prompts, pooling_params=None) -> list[EmbeddingRequestOutput]Generate an embedding vector for each prompt. Only applicable to embedding models.
paramselfparampromptsPromptType | Sequence[PromptType]The prompts to the LLM. You may pass a sequence of prompts for batch embedding.
parampooling_paramsPoolingParams | Sequence[PoolingParams] | None= NoneThe pooling parameters for pooling.
Returns
listA list of EmbeddingRequestOutput objects containing the
methodscore(self, data_1, data_2, /, *, truncate_prompt_tokens=None, pooling_params=None, chat_template=None) -> list[ScoringRequestOutput]Generate similarity scores for all pairs \<text,text_pair>.
The inputs can be 1 -> 1, 1 -> N or N -> N.
In the 1 - N case the data_1 input will be replicated N
times to pair with the data_2 inputs.
Returns:
A list of ScoringRequestOutput objects containing the
generated scores in the same order as the input prompts.
paramselfparamdata_1PromptType | Sequence[PromptType]Can be a single prompt or a list of prompts.
When a list, it must have the same length as the data_2 list.
paramdata_2PromptType | Sequence[PromptType]The data to pair with the query to form the input to the LLM.
paramtruncate_prompt_tokensint | None= NoneThe number of tokens to truncate the prompt to.
parampooling_paramsPoolingParams | None= NoneThe pooling parameters for pooling. If None, we use the default pooling parameters.
paramchat_templatestr | None= NoneThe chat template to use for the scoring. If None, we use the model's default chat template.
Returns
list[furiosa_llm.outputs.ScoringRequestOutput]methodshutdown(self)Shutdown the LLM engine gracefully. Idempotent.
paramselfReturns
NoneKey Methods
generate()
Generate text completions for the given prompts using sampling parameters. This is the primary method for text generation tasks.
When sampling_params is omitted or None, the defaults returned by
get_default_sampling_params are used. The same behavior applies to
chat and stream_generate.
See Default Sampling Parameters from generation_config.json for details.
See the chat example for usage.
get_default_sampling_params()
Returns the SamplingParams used when generate(),
chat(), or stream_generate() is called without an explicit
sampling_params argument.
If the loaded artifact contains a generation_config.json file, its
values populate the returned object; otherwise a plain SamplingParams()
is returned. See Default Sampling Parameters from generation_config.json for the full list of
honored fields and the resolution rules on the server side.
embed()
Generate embedding vectors for the given prompts. This method is only applicable to embedding models.
Parameters:
prompts(PromptType | Sequence[PromptType]): The prompts to encode. Can be a single prompt or a sequence for batch processing.pooling_params(PoolingParams | Sequence[PoolingParams] | None): The pooling parameters. If None, default parameters are used.
Returns:
List[EmbeddingRequestOutput]: A list of embedding outputs containing the embedding vectors in the same order as the input prompts.
Example:
from furiosa_llm import LLM, PoolingParams
with LLM(artifact_path="path/to/embedding/model") as llm:
# Single embedding
outputs = llm.embed("Hello, world!")
embedding = outputs[0].outputs.embedding
# Batch embedding with normalization disabled
params = PoolingParams(normalize=False)
outputs = llm.embed(["First text", "Second text"], pooling_params=params)See the embedding example for more details.
score()
Generate similarity scores for text pairs. This method is only supported for binary classification models,
including Qwen3-Reranker models or models converted using as_binary_seq_cls_model.
Parameters:
data_1(PromptType | Sequence[PromptType]): The first input text(s). Can be a single prompt or a list.data_2(PromptType | Sequence[PromptType]): The second input text(s) to pair with the first.truncate_prompt_tokens(int | None): Maximum number of tokens to truncate the prompt to. If None, no truncation is applied.pooling_params(PoolingParams | None): The pooling parameters. If None, default parameters are used.chat_template(str | None): Custom chat template for scoring. If None, the model's default template is used.
Input Patterns:
- 1-to-1: Single text paired with single text
- 1-to-N: Single text paired with multiple texts (data_1 is replicated N times)
- N-to-N: Multiple texts paired element-wise (both lists must have the same length)
Returns:
List[ScoringRequestOutput]: A list of scoring outputs containing similarity scores in the same order as the input pairs.
Example:
from furiosa_llm import LLM, PoolingParams
with LLM(artifact_path="path/to/reranker/model") as llm:
# 1-to-N scoring: one query against multiple documents
query = "What is machine learning?"
documents = [
"Machine learning is a subset of AI",
"Python is a programming language",
"Deep learning uses neural networks"
]
outputs = llm.score(query, documents)
for i, output in enumerate(outputs):
print(f"Document {i}: score = {output.outputs.score}")See the score example for more details.