Furiosa-LLM

A high-performance inference engine for LLM and multi-modal LLM models, offering state-of-the-art serving efficiency and optimizations.

Furiosa-LLM is a high-performance inference engine for LLM and multi-modal LLM models. Furiosa-LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa-LLM include:

  • vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)
  • Efficient KV cache management with PagedAttention
  • Continuous batching of incoming requests
  • Quantization: FP8 (Planned: INT4, INT8, GPTQ, AWQ)
  • Support for data, tensor, and pipeline parallelism across multiple NPUs
  • OpenAI-compatible API server
  • Various decoding algorithms: greedy search, top-k/top-p, and speculative decoding (planned for 2026.3)
  • Tool calling and reasoning parser support
  • Structured output generation (choice, regex, json schema, grammar)
  • Chunked Prefill
  • Integration with Hugging Face models and hub support
  • Hugging Face PEFT support (planned)

Documentation

On this page