AsyncLLMEngine class
Asynchronous interface for text generation with Furiosa-LLM, configurable through command-line arguments.
Overview
The AsyncLLMEngine provides an asynchronous interface for text generation, supporting configuration through command-line arguments.
Example Usage
import argparse
from furiosa_llm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams
async def main():
parser = argparse.ArgumentParser()
parser = AsyncEngineArgs.add_cli_args(parser)
args = parser.parse_args()
engine_args = AsyncEngineArgs.from_cli_args(args)
engine = AsyncLLMEngine.from_engine_args(engine_args)
example_input = {
"prompt": "What is LLM?",
"temperature": 0.0,
"request_id": "request-123",
}
results_generator = engine.generate(
example_input["prompt"],
SamplingParams(temperature=example_input["temperature"]),
example_input["request_id"]
)
final_output = None
async for request_output in results_generator:
final_output = request_output
print(final_output)
if __name__ == "__main__":
import asyncio
asyncio.run(main())The script can be executed with various arguments defined in AsyncEngineArgs, as shown in the following example:
python async_llm_engine.py --model /path/to/model --devices npu:0For a comprehensive list of available arguments for AsyncEngineArgs, please refer to the section below.
Arguments supported by AsyncLLMEngine
The arguments are identical to those specified in Arguments supported by LLMEngine.
usage: engine_cli_async.py [-h] --model MODEL [--revision REVISION]
[--tokenizer TOKENIZER] [--tokenizer-mode TOKENIZER_MODE]
[--seed SEED] [--devices DEVICES]
[--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
[--data-parallel-size DATA_PARALLEL_SIZE]
[--cache-dir CACHE_DIR]
[--npu-queue-limit NPU_QUEUE_LIMIT]
[--max-processing-samples MAX_PROCESSING_SAMPLES]
[--spare-blocks-ratio SPARE_BLOCKS_RATIO]
[--enable-jit-compilation ENABLE_JIT_COMPILATION]
[--jit-threshold JIT_THRESHOLD]
[--jit-max-workers JIT_MAX_WORKERS]
[--jit-unit-size JIT_UNIT_SIZE]
options:
-h, --help show this help message and exit
--model MODEL The Hugging Face model id, or path to Furiosa model artifact.
Currently only one model is supported per server.
--revision REVISION The specific model revision on Hugging Face Hub if the model
is given as a Hugging Face model id. It can be a branch name,
a tag name, or a commit id. Its default value is main.
However, if a given model belongs to the furiosa-ai
organization, the model will use the release model tag by
default.
--tokenizer TOKENIZER
The name or path of a HuggingFace Transformers tokenizer.
--tokenizer-mode TOKENIZER_MODE
The tokenizer mode. "auto" will use the fast tokenizer if
available, and "slow" will always use the slow tokenizer.
--seed SEED The seed to initialize the random number generator for
sampling.
--devices DEVICES The devices to run the model. It can be a single device or a
comma-separated list of devices. Each device can be either
"npu:X" or "npu:X:Y", where X is a device index and Y is a
NPU core range notation (e.g. "npu:0" for whole npu 0,
"npu:0:0" for core 0 of NPU 0, and "npu:0:0-3" for fused core
0-3 of npu 0). If not given, all available unoccupied devices
will be used.
--pipeline-parallel-size PIPELINE_PARALLEL_SIZE
The size of the pipeline parallelism group. If not given, it
will use the value from artifact.
--data-parallel-size DATA_PARALLEL_SIZE
The size of the data parallelism group. If not given, it will
be inferred from total available PEs and other parallelism
degrees.
--cache-dir CACHE_DIR
The cache directory for temporarily generated files for this
LLM instance. When its value is `None`, caching is
disabled. The default is "$HOME/.cache/furiosa/llm".
--npu-queue-limit NPU_QUEUE_LIMIT
The NPU queue limit of the scheduler config.
--max-processing-samples MAX_PROCESSING_SAMPLES
The maximum processing samples. Used as an hint for the
scheduler.
--spare-blocks-ratio SPARE_BLOCKS_RATIO
The spare blocks ratio. Used as an hint for the scheduler.
--enable-jit-compilation ENABLE_JIT_COMPILATION
[EXPERIMENTAL] Enable JIT compilation.
--jit-threshold JIT_THRESHOLD
[EXPERIMENTAL] Number of requests before triggering JIT
compilation.
--jit-max-workers JIT_MAX_WORKERS
[EXPERIMENTAL] Maximum concurrent background JIT
compilations.
--jit-unit-size JIT_UNIT_SIZE
[EXPERIMENTAL] Number of stages to compile together. Must be
>= 2.API Reference
classAsyncLLMEngineAsyncLLMEngine receives requests and generates texts asynchronously.
Implements the API interface compatible with vLLM's AsyncLLMEngine, but this class is based on furiosa-runtime and FuriosaAI NPU.
paramselfparamnative_engineNativeEngineLikeparamtokenizerAnyTokenizerparamtask_typeGenerationTask | PoolingTaskparammax_model_lenintparamllmLLM | None= Noneparammodel_pathstr | None= Noneparamtrust_remote_codebool= FalseAttributes
attributenative_engine= native_engineattributetokenizer= tokenizerattributetask_type= task_typeattributemax_model_len= max_model_lenattributemodel_path= model_path or (getattr(llm, 'model_id_or_path', None) if llm is not None else None)attributetrust_remote_code= trust_remote_codeattributerequest_ids= set()Methods
methodfrom_llm(cls, llm) -> AsyncLLMEngineparamclsparamllmLLMReturns
furiosa_llm.llm_engine.AsyncLLMEnginemethodfrom_engine_args(cls, args) -> AsyncLLMEngineCreates an AsyncLLMEngine from AsyncEngineArgs.
paramclsparamargsAsyncEngineArgsReturns
furiosa_llm.llm_engine.AsyncLLMEnginemethodgenerate(self, prompt, sampling_params, request_id) -> AsyncGenerator[RequestOutput, None]Generates text completions for a given prompt.
paramselfparampromptPromptTypeThe prompt to the LLM. See PromptType
for more details about the format of each input.
paramsampling_paramsSamplingParamsThe sampling parameters of the request.
paramrequest_idstrThe unique id of the request.
Returns
collections.abc.AsyncGenerator[furiosa_llm.api.RequestOutput, None]methodencode(self, prompt, pooling_params, request_id, lora_request=None, trace_headers=None, priority=None, truncate_prompt_tokens=None, tokenization_kwargs=None) -> AsyncGenerator[PoolingRequestOutput, None]Apply pooling to the hidden states corresponding to the input prompts.
lora_request, trace_headers, truncate_prompt_tokens, and priority are not supported.
They are just placeholders for compatibility with the vLLM API.
paramselfparampromptPromptTypeparampooling_paramsPoolingParamsparamrequest_idstrparamlora_request= Noneparamtrace_headers= Noneparampriority= Noneparamtruncate_prompt_tokensint | None= Noneparamtokenization_kwargsdict[str, Any] | None= NoneReturns
collections.abc.AsyncGenerator[furiosa_llm.api.PoolingRequestOutput, None]methodabort(self, request_id) -> NoneAborts a request with the given ID.
paramselfparamrequest_idstrReturns
None