Furiosa-LLM
A high-performance inference engine for LLM and multi-modal LLM models, offering state-of-the-art serving efficiency and optimizations.
Furiosa-LLM is a high-performance inference engine for LLM and multi-modal LLM models. Furiosa-LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa-LLM include:
- vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)
- Efficient KV cache management with PagedAttention
- Continuous batching of incoming requests
- Quantization: FP8 (Planned: INT4, INT8, GPTQ, AWQ)
- Support for data, tensor, and pipeline parallelism across multiple NPUs
- OpenAI-compatible API server
- Various decoding algorithms: greedy search, top-k/top-p, and speculative decoding (planned for 2026.3)
- Tool calling and reasoning parser support
- Structured output generation (choice, regex, json schema, grammar)
- Chunked Prefill
- Integration with Hugging Face models and hub support
- Hugging Face PEFT support (planned)
Documentation
- Getting Started with Furiosa-LLM: A quick start guide to Furiosa-LLM
- Furiosa-LLM OpenAI-Compatible Server: Details about the OpenAI-compatible server and its features
- Responses API: Guide to the OpenResponses-compatible Responses API
- Tool Calling: Guide to tool calling with parsers and choice options
- Structured Output: Guide to structured output generation
- Vision-Language Models: Guide to serving Vision-Language models with image inputs
- Prefix Caching: Guide to prefix caching for improved performance
- Hybrid KV Cache: Understanding hybrid KV cache management
- Data Parallel Routing: Understanding scoring-based data-parallel routing
- Model Preparation: How to prepare LLM models to be served by Furiosa-LLM
- Model Parallelism: A guide to model parallelism in Furiosa-LLM
- API Reference: Python API reference for Furiosa-LLM
- Examples: Examples of using Furiosa-LLM
- Kubernetes Deployment: A guide to deploying Furiosa-LLM on Kubernetes