AsyncLLMEngine class

Asynchronous interface for text generation with Furiosa-LLM, configurable through command-line arguments.

Overview

The AsyncLLMEngine provides an asynchronous interface for text generation, supporting configuration through command-line arguments.

Example Usage

python

import argparse
from furiosa_llm import AsyncLLMEngine, AsyncEngineArgs, SamplingParams

async def main():
    parser = argparse.ArgumentParser()
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    engine_args = AsyncEngineArgs.from_cli_args(args)
    engine = AsyncLLMEngine.from_engine_args(engine_args)
    example_input = {
        "prompt": "What is LLM?",
        "temperature": 0.0,
        "request_id": "request-123",
    }

    results_generator = engine.generate(
        example_input["prompt"],
        SamplingParams(temperature=example_input["temperature"]),
        example_input["request_id"]
    )

    final_output = None
    async for request_output in results_generator:
        final_output = request_output

    print(final_output)

if __name__ == "__main__":
    import asyncio
    asyncio.run(main())

The script can be executed with various arguments defined in AsyncEngineArgs, as shown in the following example:

python async_llm_engine.py --model /path/to/model --devices npu:0

For a comprehensive list of available arguments for AsyncEngineArgs, please refer to the section below.

Arguments supported by AsyncLLMEngine

The arguments are identical to those specified in Arguments supported by LLMEngine.

text

usage: engine_cli_async.py [-h] --model MODEL [--revision REVISION]
                           [--tokenizer TOKENIZER] [--tokenizer-mode TOKENIZER_MODE]
                           [--seed SEED] [--devices DEVICES]
                           [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                           [--data-parallel-size DATA_PARALLEL_SIZE]
                           [--cache-dir CACHE_DIR]
                           [--npu-queue-limit NPU_QUEUE_LIMIT]
                           [--max-processing-samples MAX_PROCESSING_SAMPLES]
                           [--spare-blocks-ratio SPARE_BLOCKS_RATIO]
                           [--enable-jit-compilation ENABLE_JIT_COMPILATION]
                           [--jit-threshold JIT_THRESHOLD]
                           [--jit-max-workers JIT_MAX_WORKERS]
                           [--jit-unit-size JIT_UNIT_SIZE]

options:
  -h, --help            show this help message and exit
  --model MODEL         The Hugging Face model id, or path to Furiosa model artifact.
                        Currently only one model is supported per server.
  --revision REVISION   The specific model revision on Hugging Face Hub if the model
                        is given as a Hugging Face model id. It can be a branch name,
                        a tag name, or a commit id. Its default value is main.
                        However, if a given model belongs to the furiosa-ai
                        organization, the model will use the release model tag by
                        default.
  --tokenizer TOKENIZER
                        The name or path of a HuggingFace Transformers tokenizer.
  --tokenizer-mode TOKENIZER_MODE
                        The tokenizer mode. "auto" will use the fast tokenizer if
                        available, and "slow" will always use the slow tokenizer.
  --seed SEED           The seed to initialize the random number generator for
                        sampling.
  --devices DEVICES     The devices to run the model. It can be a single device or a
                        comma-separated list of devices. Each device can be either
                        "npu:X" or "npu:X:Y", where X is a device index and Y is a
                        NPU core range notation (e.g. "npu:0" for whole npu 0,
                        "npu:0:0" for core 0 of NPU 0, and "npu:0:0-3" for fused core
                        0-3 of npu 0). If not given, all available unoccupied devices
                        will be used.
  --pipeline-parallel-size PIPELINE_PARALLEL_SIZE
                        The size of the pipeline parallelism group. If not given, it
                        will use the value from artifact.
  --data-parallel-size DATA_PARALLEL_SIZE
                        The size of the data parallelism group. If not given, it will
                        be inferred from total available PEs and other parallelism
                        degrees.
  --cache-dir CACHE_DIR
                        The cache directory for temporarily generated files for this
                        LLM instance. When its value is `None`, caching is
                        disabled. The default is "$HOME/.cache/furiosa/llm".
  --npu-queue-limit NPU_QUEUE_LIMIT
                        The NPU queue limit of the scheduler config.
  --max-processing-samples MAX_PROCESSING_SAMPLES
                        The maximum processing samples. Used as an hint for the
                        scheduler.
  --spare-blocks-ratio SPARE_BLOCKS_RATIO
                        The spare blocks ratio. Used as an hint for the scheduler.
  --enable-jit-compilation ENABLE_JIT_COMPILATION
                        [EXPERIMENTAL] Enable JIT compilation.
  --jit-threshold JIT_THRESHOLD
                        [EXPERIMENTAL] Number of requests before triggering JIT
                        compilation.
  --jit-max-workers JIT_MAX_WORKERS
                        [EXPERIMENTAL] Maximum concurrent background JIT
                        compilations.
  --jit-unit-size JIT_UNIT_SIZE
                        [EXPERIMENTAL] Number of stages to compile together. Must be
                        >= 2.

AsyncLLMEngine receives requests and generates texts asynchronously. Implements the API interface compatible with vLLM's AsyncLLMEngine, but this class is based on furiosa-runtime and FuriosaAI NPU.

paramself

paramnative_engineNativeEngineLike

paramtokenizerAnyTokenizer

paramtask_typeGenerationTask | PoolingTask

parammax_model_lenint

paramllmLLM | None

= None

parammodel_pathstr | None

= None

paramtrust_remote_codebool

= False

Attributes

attributenative_engine

= native_engine

attributetokenizer

= tokenizer

attributetask_type

= task_type

attributemax_model_len

= max_model_len

attributemodel_path

= model_path or (getattr(llm, 'model_id_or_path', None) if llm is not None else None)

attributetrust_remote_code

= trust_remote_code

attributerequest_ids

= set()

Methods

methodfrom_llm(cls, llm) -> AsyncLLMEngine

paramcls

paramllmLLM

Returns

furiosa_llm.llm_engine.AsyncLLMEngine

methodfrom_engine_args(cls, args) -> AsyncLLMEngine

Creates an AsyncLLMEngine from AsyncEngineArgs.

paramcls

paramargsAsyncEngineArgs

Returns

furiosa_llm.llm_engine.AsyncLLMEngine

methodgenerate(self, prompt, sampling_params, request_id) -> AsyncGenerator[RequestOutput, None]

Generates text completions for a given prompt.

paramself

parampromptPromptType

The prompt to the LLM. See PromptType for more details about the format of each input.

paramsampling_paramsSamplingParams

The sampling parameters of the request.

paramrequest_idstr

The unique id of the request.

Returns

collections.abc.AsyncGenerator[furiosa_llm.api.RequestOutput, None]

methodencode

(self, prompt, pooling_params, request_id, lora_request=None, trace_headers=None, priority=None, truncate_prompt_tokens=None, tokenization_kwargs=None) -> AsyncGenerator[PoolingRequestOutput, None]

Apply pooling to the hidden states corresponding to the input prompts.

lora_request, trace_headers, truncate_prompt_tokens, and priority are not supported. They are just placeholders for compatibility with the vLLM API.

paramself

parampromptPromptType

parampooling_paramsPoolingParams

paramrequest_idstr

paramlora_request

= None

paramtrace_headers

= None

parampriority

= None

paramtruncate_prompt_tokensint | None

= None

paramtokenization_kwargsdict[str, Any] | None

= None

Returns

collections.abc.AsyncGenerator[furiosa_llm.api.PoolingRequestOutput, None]

methodabort(self, request_id) -> None

Aborts a request with the given ID.

paramself

paramrequest_idstr

Returns

None

AsyncLLMEngine class

Overview

Example Usage

Arguments supported by AsyncLLMEngine

API Reference

Attributes

Methods

On this page