Furiosa-LLM

A high-performance inference engine for LLM and multi-modal LLM models, offering state-of-the-art serving efficiency and optimizations.

Furiosa-LLM is a high-performance inference engine for LLM and multi-modal LLM models. Furiosa-LLM offers state-of-the-art serving efficiency and optimizations. Key features of Furiosa-LLM include:

vLLM-compatible API (LLM, LLMEngine, AsyncLLMEngine API)
Efficient KV cache management with PagedAttention
Continuous batching of incoming requests
Quantization: FP8 (Planned: INT4, INT8, GPTQ, AWQ)
Support for data, tensor, and pipeline parallelism across multiple NPUs
OpenAI-compatible API server
Various decoding algorithms: greedy search, top-k/top-p, and speculative decoding (planned for 2026.3)
Tool calling and reasoning parser support
Structured output generation (choice, regex, json schema, grammar)
Chunked Prefill
Integration with Hugging Face models and hub support
Hugging Face PEFT support (planned)

Documentation

Getting Started with Furiosa-LLM: A quick start guide to Furiosa-LLM
Furiosa-LLM OpenAI-Compatible Server: Details about the OpenAI-compatible server and its features
Responses API: Guide to the OpenResponses-compatible Responses API
Tool Calling: Guide to tool calling with parsers and choice options
Structured Output: Guide to structured output generation
Vision-Language Models: Guide to serving Vision-Language models with image inputs
Prefix Caching: Guide to prefix caching for improved performance
Hybrid KV Cache: Understanding hybrid KV cache management
Data Parallel Routing: Understanding scoring-based data-parallel routing
Model Preparation: How to prepare LLM models to be served by Furiosa-LLM
Model Parallelism: A guide to model parallelism in Furiosa-LLM
API Reference: Python API reference for Furiosa-LLM
Examples: Examples of using Furiosa-LLM
Kubernetes Deployment: A guide to deploying Furiosa-LLM on Kubernetes

Furiosa-LLM

Documentation

On this page