LLM class

API reference for the furiosa_llm.LLM class and its key methods for text generation, embedding, and scoring.

classLLM

An LLM for generating texts from given prompts and sampling parameters.

paramself

parammodel_id_or_pathstr | os.PathLike

paramfxbstr | os.PathLike | None

= None

paramrevisionstr | None

= None

paramdevicesstr | Sequence[Device] | None

= None

paramdata_parallel_sizeint | None

= None

parampipeline_parallel_sizeint | None

= None

paramnum_blocks_per_pp_stageSequence[int] | None

= None

parammax_io_memory_mbint

= 2048

paramscheduler_configSchedulerConfig | None

= None

paramstructured_outputs_backendLiteral['auto', 'guidance', 'xgrammar']

= 'auto'

paramtokenizerstr | PreTrainedTokenizer | PreTrainedTokenizerFast | None

= None

paramtokenizer_modeTokenizerModeType

= 'auto'

paramseedint | None

= None

paramcache_diros.PathLike

= CACHE_DIR

paramskip_enginebool

= False

paramenable_jit_compilationbool

= False

paramjit_thresholdint

= DEFAULT_JIT_THRESHOLD

paramjit_max_workersint

= DEFAULT_JIT_MAX_WORKERS

paramjit_unit_sizeint

= DEFAULT_JIT_UNIT_SIZE

paramserved_model_namestr | None

= None

paramkwargs

= {}

Attributes

attributemax_model_lenint

attributeengineNativeEngineLike

Methods

methodload_artifact(cls, model_id_or_path, **kwargs) -> LLM

Deprecated: Use LLM() constructor directly.

This method is kept for backward compatibility and will be removed in a future release.

paramcls

parammodel_id_or_pathstr | os.PathLike

paramkwargs

= {}

Returns

furiosa_llm.api.LLM

methodget_default_sampling_params(self) -> SamplingParams

Return SamplingParams reflecting model's generation_config defaults.

If the model has no generation_config or it matches HF defaults, returns a default SamplingParams().

paramself

Returns

furiosa_llm.sampling_params.SamplingParams

methodgenerate

(self, prompts, sampling_params=None, prompt_token_ids=None, tokenizer_kwargs=None) -> RequestOutput | list[RequestOutput]

Generate texts from given prompts and sampling parameters.

paramself

parampromptsstr | list[str]

The prompts to generate texts.

paramsampling_paramsSamplingParams | None

= None

The sampling parameters for generating texts. If None, model's generation config defaults are used.

paramprompt_token_idsBatchEncoding | None

= None

Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.

paramtokenizer_kwargsdict[str, Any] | None

= None

Additional keyword arguments passed to the tokenizer's encode method, such as \{"use_special_tokens": True\}.

Returns

RequestOutput | list[RequestOutput]

A list of RequestOutput objects containing the generated

methodchat

(self, messages, sampling_params=None, chat_template=None, chat_template_content_format='string', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None) -> list[RequestOutput]

Generate responses for a chat conversation.

The chat conversation is converted into a text prompt using the tokenizer and calls the generate method to generate the responses.

paramself

parammessageslist[ChatCompletionMessageParam] | list[list[ChatCompletionMessageParam]]

A list of conversations or a single conversation.

Each conversation is represented as a list of messages.
Each message is a dictionary with 'role' and 'content' keys.

paramsampling_paramsSamplingParams | None

= None

The sampling parameters for text generation.

paramchat_templatestr | None

= None

The template to use for structuring the chat. If not provided, the model's default chat template will be used.

paramchat_template_content_formatChatTemplateContentFormatOption

= 'string'

The format to render message content. Currently only "string" is supported.

paramadd_generation_promptbool

= True

If True, adds a generation template to each message.

paramcontinue_final_messagebool

= False

If True, continues the final message in the conversation instead of starting a new one. Cannot be True if add_generation_prompt is also True.

paramtoolslist[dict[str, Any]] | None

= None

Optional list of tools to use in the chat.

paramchat_template_kwargsdict[str, Any] | None

= None

Additional keyword arguments to pass to the chat template rendering function.

Returns

list

A list of RequestOutput objects containing the generated

methodstream_generate

(self, prompt, sampling_params=None, prompt_token_ids=None, tokenizer_kwargs=None, is_demo=False) -> AsyncGenerator[str, None]

Generate texts from given prompt and sampling parameters.

paramself

parampromptstr

The prompt to generate texts. Note that unlike generate, this API supports only a single prompt.

paramsampling_paramsSamplingParams | None

= None

The sampling parameters for generating texts.

paramprompt_token_idsBatchEncoding | None

= None

Pre-tokenized prompt input as a BatchEncoding object. If not provided, the prompt will be tokenized internally using the tokenizer.

paramtokenizer_kwargsdict[str, Any] | None

= None

Additional keyword arguments passed to the tokenizer's encode method, such as \{"use_special_tokens": True\}.

paramis_demobool

= False

Returns

collections.abc.AsyncGenerator

A stream of generated output tokens.

methodencode(self, prompts, pooling_params=None, *, pooling_task=None) -> list[PoolingRequestOutput]

Apply pooling to the hidden states corresponding to the input prompts.

paramself

parampromptsPromptType | Sequence[PromptType]

The prompts to the LLM. You may pass a sequence of prompts for batch inference.

parampooling_paramsPoolingParams | Sequence[PoolingParams] | None

= None

The pooling parameters for pooling.

parampooling_taskPoolingTask | None

= None

Override the pooling task to use.

Returns

list

A list of PoolingRequestOutput objects containing the

methodembed(self, prompts, pooling_params=None) -> list[EmbeddingRequestOutput]

Generate an embedding vector for each prompt. Only applicable to embedding models.

paramself

parampromptsPromptType | Sequence[PromptType]

The prompts to the LLM. You may pass a sequence of prompts for batch embedding.

parampooling_paramsPoolingParams | Sequence[PoolingParams] | None

= None

The pooling parameters for pooling.

Returns

list

A list of EmbeddingRequestOutput objects containing the

methodscore

(self, data_1, data_2, /, *, truncate_prompt_tokens=None, pooling_params=None, chat_template=None) -> list[ScoringRequestOutput]

Generate similarity scores for all pairs \<text,text_pair>.

The inputs can be 1 -> 1, 1 -> N or N -> N. In the 1 - N case the data_1 input will be replicated N times to pair with the data_2 inputs.

Returns: A list of ScoringRequestOutput objects containing the generated scores in the same order as the input prompts.

paramself

paramdata_1PromptType | Sequence[PromptType]

Can be a single prompt or a list of prompts. When a list, it must have the same length as the data_2 list.

paramdata_2PromptType | Sequence[PromptType]

The data to pair with the query to form the input to the LLM.

paramtruncate_prompt_tokensint | None

= None

The number of tokens to truncate the prompt to.

parampooling_paramsPoolingParams | None

= None

The pooling parameters for pooling. If None, we use the default pooling parameters.

paramchat_templatestr | None

= None

The chat template to use for the scoring. If None, we use the model's default chat template.

Returns

list[furiosa_llm.outputs.ScoringRequestOutput]

methodshutdown(self)

Shutdown the LLM engine gracefully. Idempotent.

paramself

Returns

None

Key Methods

generate()

Generate text completions for the given prompts using sampling parameters. This is the primary method for text generation tasks.

When sampling_params is omitted or None, the defaults returned by get_default_sampling_params are used. The same behavior applies to chat and stream_generate. See Default Sampling Parameters from generation_config.json for details.

See the chat example for usage.

get_default_sampling_params()

Returns the SamplingParams used when generate(), chat(), or stream_generate() is called without an explicit sampling_params argument.

If the loaded artifact contains a generation_config.json file, its values populate the returned object; otherwise a plain SamplingParams() is returned. See Default Sampling Parameters from generation_config.json for the full list of honored fields and the resolution rules on the server side.

embed()

Generate embedding vectors for the given prompts. This method is only applicable to embedding models.

Parameters:

prompts (PromptType | Sequence[PromptType]): The prompts to encode. Can be a single prompt or a sequence for batch processing.
pooling_params (PoolingParams | Sequence[PoolingParams] | None): The pooling parameters. If None, default parameters are used.

Returns:

List[EmbeddingRequestOutput]: A list of embedding outputs containing the embedding vectors in the same order as the input prompts.

Example:

python

from furiosa_llm import LLM, PoolingParams

with LLM(artifact_path="path/to/embedding/model") as llm:
    # Single embedding
    outputs = llm.embed("Hello, world!")
    embedding = outputs[0].outputs.embedding

    # Batch embedding with normalization disabled
    params = PoolingParams(normalize=False)
    outputs = llm.embed(["First text", "Second text"], pooling_params=params)

See the embedding example for more details.

score()

Generate similarity scores for text pairs. This method is only supported for binary classification models, including Qwen3-Reranker models or models converted using as_binary_seq_cls_model.

Parameters:

data_1 (PromptType | Sequence[PromptType]): The first input text(s). Can be a single prompt or a list.
data_2 (PromptType | Sequence[PromptType]): The second input text(s) to pair with the first.
truncate_prompt_tokens (int | None): Maximum number of tokens to truncate the prompt to. If None, no truncation is applied.
pooling_params (PoolingParams | None): The pooling parameters. If None, default parameters are used.
chat_template (str | None): Custom chat template for scoring. If None, the model's default template is used.

Input Patterns:

1-to-1: Single text paired with single text
1-to-N: Single text paired with multiple texts (data_1 is replicated N times)
N-to-N: Multiple texts paired element-wise (both lists must have the same length)

Returns:

List[ScoringRequestOutput]: A list of scoring outputs containing similarity scores in the same order as the input pairs.

Example:

python

from furiosa_llm import LLM, PoolingParams

with LLM(artifact_path="path/to/reranker/model") as llm:
    # 1-to-N scoring: one query against multiple documents
    query = "What is machine learning?"
    documents = [
        "Machine learning is a subset of AI",
        "Python is a programming language",
        "Deep learning uses neural networks"
    ]

    outputs = llm.score(query, documents)
    for i, output in enumerate(outputs):
        print(f"Document {i}: score = {output.outputs.score}")

See the score example for more details.

LLM class

Attributes

Methods

Key Methods

generate()

get_default_sampling_params()

embed()

score()

On this page