Vision-Language Models

Serve Vision-Language (VL) models that accept text and images through the OpenAI-compatible Chat Completions API with Furiosa-LLM.

Furiosa-LLM serves Vision-Language (VL) models that accept text and images through the OpenAI-compatible Chat Completions API. The first supported architecture is Qwen3-VL (e.g., Qwen/Qwen3-VL-32B-Instruct); see Supported Models for the authoritative list.

Image inputs may be passed as a remote https:// URL, a base64 data: URL, or a local file:// path. Preprocessing (resizing, patching, prompt-token expansion) is handled by the Hugging Face AutoProcessor that ships with the model.

Quick Start

Launch a server against a compiled VL artifact:

bash

furiosa-llm serve Qwen/Qwen3-VL-32B-Instruct \
    --fxb /path/to/qwen3-vl-32b.fxb

Send a chat-completion request with an image and a text prompt:

bash

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "Qwen/Qwen3-VL-32B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://developer.furiosa.ai/v2026.2.0/en/_images/Furiosa_RNGD_image.png"
                    }
                },
                {"type": "text", "text": "Describe this image in one short sentence."}
            ]
        }
    ],
    "max_tokens": 128
    }' \
    | python -m json.tool

(see the Python Example section below for a Python client example).

Image Input Formats

The image_url.url field in a chat-completion message accepts three URL schemes:

Remote URL (http:// / https://) — the server fetches the image with httpx.

json

{
  "type": "image_url",
  "image_url": {"url": "https://example.com/photo.png"}
}

Base64 data URL — useful for inline payloads:

json

{
  "type": "image_url",
  "image_url": {"url": "data:image/png;base64,iVBORw0KGgoAAAANSUh..."}
}

Local file (file://) — disabled by default; requires the --allowed-local-media-path server flag (see below).
json
```
{
  "type": "image_url",
  "image_url": {"url": "file:///srv/media/photo.png"}
}
```

A request may contain multiple image_url parts interleaved with text parts. The model's chat template controls how image placeholders are rendered into the prompt:

OpenAI-format chat templates (the common case) iterate the content array themselves, so image placeholders land at their original positions automatically — no extra server flag needed.
String-format chat templates flatten content into a single text string. By default, placeholders for any images are prepended to the prompt; pass --interleave-mm-strings to keep them at their original positions instead.

Server Options

The following furiosa-llm serve flags configure multimodal behaviour. Requests that violate any of them are rejected with HTTP 400 before reaching the NPU.

Flag	Default	Description
`--image-limit-per-prompt N`	unlimited	Maximum number of images allowed per request. Exceeding the limit returns HTTP 400.
`--video-limit-per-prompt N`	unlimited	Maximum number of videos allowed per request. Exceeding the limit returns HTTP 400.
`--allowed-local-media-path PATH`	disabled	Allow `file://` URLs whose resolved path is under `PATH`. Required to serve local images; other `file://` URLs are rejected.
`--allowed-media-domains D [D …]`	all allowed	Whitelist of remote domains for SSRF protection. When set, only images from listed domains are fetched; others return HTTP 400.
`--interleave-mm-strings`	off	See Image Input Formats above. No-op for the common OpenAI-format chat templates.
`--mm-processor-cache-gb GB`	`4.0`	Size of the UUID-keyed multimodal cache in GiB. Set to `0` to disable caching entirely. See Caching Multimodal Inputs with UUID below.

Example with multiple allowed domains (nargs="+"):

bash

furiosa-llm serve Qwen/Qwen3-VL-32B-Instruct \
    --fxb /path/to/qwen3-vl-32b.fxb \
    --allowed-media-domains developer.furiosa.ai cdn.example.com

WARNING

Local file access (file://) is disabled unless --allowed-local-media-path is set. Treat it as a security boundary — only point it at directories whose contents you are willing to expose to chat-completion clients.

TIP

For production deployments fetching from the public internet, restrict --allowed-media-domains to the smallest set you need to limit SSRF and bandwidth-amplification risk.

Caching Multimodal Inputs with UUID

You can optionally tag each image or video on a request with a stable uuid. The server caches the processed result under that UUID and reuses it on follow-up requests carrying the same UUID, skipping pre-processing.

Note that items without a uuid are not implicitly hashed by content; only explicitly tagged items participate in the cache.

Add uuid next to image_url on the content part:

json

{
  "type": "image_url",
  "image_url": {"url": "https://example.com/receipt.png"},
  "uuid": "receipt-2026-05-01-abc"
}

A follow-up may then omit the image bytes ("image_url": null) and reference the upload by uuid alone:

json

{
  "type": "image_url",
  "image_url": null,
  "uuid": "receipt-2026-05-01-abc"
}

If no entry for that UUID exists (unknown UUID or cache eviction), the server re-runs pre-processing when image_url is also supplied; otherwise it returns HTTP 400 unknown multimodal uuid, and the client should retry with the image attached. The same shape is available offline via multi_modal_uuids (see Offline API below).

Cache capacity is controlled by --mm-processor-cache-gb (default 4.0, 0 disables). The cap is on the total size of cached entries; the least-recently-used entries are evicted when the cache is full.

Python Example

Using the OpenAI Python client against a Furiosa-LLM server:

python

"""
OpenAI-Compatible Server - Vision-Language Chat Completion Example

Sends a chat-completion request containing an image and a text prompt
to a running Furiosa-LLM server.

The image URL below is hosted on the Furiosa developer documentation
site and is intended to be stable. For other image-input forms (base64
``data:`` URLs and local ``file://`` paths), see the Vision-Language
Models documentation page.
"""

import os

from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(base_url=base_url, api_key=api_key)

IMAGE_URL = "https://developer.furiosa.ai/v2026.2.0/en/_images/Furiosa_RNGD_image.png"

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": IMAGE_URL}},
                {"type": "text", "text": "Describe this image in one short sentence."},
            ],
        }
    ],
    max_completion_tokens=32,
    temperature=0.0,
)

print(response.choices[0].message.content)

Offline API

The same image inputs are accepted by the offline LLM / LLMEngine APIs via the multi_modal_data field on the prompt dict. Pass PIL images keyed by modality:

python

from PIL import Image
from furiosa_llm import LLM, SamplingParams

image = Image.open("photo.png")
prompts = [
    {
        "prompt": "<|vision_start|><|image_pad|><|vision_end|>Describe this image.",
        "multi_modal_data": {"image": [image]},
    }
]

with LLM("Qwen/Qwen3-VL-32B-Instruct", fxb="/path/to/qwen3-vl-32b.fxb") as llm:
    outputs = llm.generate(prompts, SamplingParams(max_tokens=64))
    print(outputs[0].outputs[0].text)

The UUID cache described in Cached Inputs (UUID) above is also available through the offline API. Add multi_modal_uuids alongside multi_modal_data, in the same order; pass None for items the client does not wish to cache:

python

prompts = [
    {
        "prompt": "<|vision_start|><|image_pad|><|vision_end|>Describe this image.",
        "multi_modal_data": {"image": [image]},
        "multi_modal_uuids": {"image": ["receipt-2026-05-01-abc"]},
    }
]

A follow-up prompt may then omit the image and reference it by UUID alone — pass None in the multi_modal_data slot:

python

followup = [
    {
        "prompt": "<|vision_start|><|image_pad|><|vision_end|>Translate to Korean.",
        "multi_modal_data": {"image": [None]},
        "multi_modal_uuids": {"image": ["receipt-2026-05-01-abc"]},
    }
]

See API Reference for the full API surface.

Vision-Language Models

Quick Start

Image Input Formats

Server Options

Caching Multimodal Inputs with UUID

Python Example

Offline API

See Also

On this page