Responses API

Furiosa-LLM's OpenResponses-compatible API for text generation, supporting streaming, multi-turn conversations, tool calling, and structured output.

Furiosa-LLM implements the OpenResponses specification, a multi-provider, interoperable LLM interface. The Responses API is available for text generation models with a chat template. It supports text input, streaming, multi-turn conversations, tool calling with tool_choice control (see Tool Calling Support), and structured output via JSON Schema.

NOTE

The following features from the OpenResponses specification are not yet supported:

Multimodal inputs — input_image, input_audio, and input_file content types are accepted but silently ignored.
Built-in tools — web_search, file_search, code_interpreter, computer_use, and mcp tools are not supported. Only custom function tools are available.
Background processing — background=true is accepted but has no effect.
Auto-truncation — truncation="auto" is accepted but only "disabled" is implemented.

Endpoints

POST /v1/responses — Create a response (streaming or non-streaming).
GET /v1/responses/{response_id} — Retrieve a previously stored response.
POST /v1/responses/{response_id}/cancel — Cancel an in-progress response.

Response Store

The response store keeps responses in memory so they can be retrieved later or referenced by previous_response_id for multi-turn conversations. The store is disabled by default and must be explicitly enabled with the --enable-responses-api-store server option:

bash

furiosa-llm serve [ARTIFACT_PATH] --enable-responses-api-store

The following server options control the store behavior:

Option	Default	Description
`--enable-responses-api-store`	false	Enable the in-memory response store.
`--responses-api-store-max-entries`	10000	Maximum number of responses to keep. Oldest entries are evicted when the limit is reached.
`--responses-api-store-ttl`	3600	Time-to-live for stored responses in seconds.

When the store is enabled and store=true is set in the request (the default), the server stores the response and its full chat message history, enabling:

Response retrieval via GET /v1/responses/{response_id}.
Response cancellation via POST /v1/responses/{response_id}/cancel.
Multi-turn conversations via previous_response_id.

To skip storage for a specific request, set store=false.

When the store is disabled, GET /v1/responses/{response_id}, POST /v1/responses/{response_id}/cancel, and previous_response_id are not available.

NOTE

Stored responses and their conversation histories are held in memory and are lost when the server restarts. Monitor memory consumption on the server if you expect a large number of stored responses.

Multi-Turn Conversations

The Responses API supports two methods for multi-turn conversations:

Using previous_response_id (recommended)

The server automatically prepends the stored conversation history from the referenced response. This is the simplest approach and requires store=true on the referenced response.

python

import os
from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id

# Turn 1: store the response for later continuation
res1 = client.responses.create(
    model=model,
    input="My name is Alice.",
    store=True,
)
print(f"Turn 1: {res1.output[0].content[0].text}")
print(f"Response ID: {res1.id}")

# Turn 2: continue the conversation using previous_response_id
res2 = client.responses.create(
    model=model,
    input="What is my name?",
    previous_response_id=res1.id,
    store=True,
)
print(f"Turn 2: {res2.output[0].content[0].text}")

Using manual context

Alternatively, you can manually build the conversation context by appending previous output items to the input array:

python

# Turn 1
res1 = client.responses.create(model=model, input="My name is Alice.")

# Turn 2: manually include previous context
context = [{"role": "user", "content": "My name is Alice."}]
context += res1.output
context.append({"role": "user", "content": "What is my name?"})
res2 = client.responses.create(model=model, input=context)

Examples

Basic usage:

python

import os
from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id

# Non-streaming
response = client.responses.create(
    model=model,
    input="What is the capital of France?",
)
print(response.output[0].content[0].text)

Streaming:

python

import os
from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id

with client.responses.stream(
    model=model,
    input="What is the capital of France?",
) as stream:
    for event in stream:
        if event.type == "response.output_text.delta":
            print(event.delta, end="", flush=True)
print()

API Reference

Parameters without descriptions inherit their behavior and functionality from the corresponding parameters in the OpenResponses specification.

NOTE

For sampling-related fields (temperature, top_p, top_k, max_output_tokens), each field resolves in this order:

The value specified in the request body, if set.
The value from the model's generation_config.json, if present (max_output_tokens is populated from max_new_tokens).
The API default shown in the parameter table below.

See Default Sampling Parameters from generation_config.json for the supported-field mapping and offline-API behavior.

Name	Type	Default	Description
model	string		Required by the client, but the value is ignored on the server.
input	string or array		Text string, or an array of input items (messages, function call outputs). Multimodal content types (`input_image`, `input_audio`, `input_file`) are not yet supported.
instructions	string	null	System-level instructions prepended to the conversation.
stream	boolean	false
store	boolean	true	When true and `--enable-responses-api-store` is set, the response is stored in-memory for later retrieval via `previous_response_id` and `GET /v1/responses/{response_id}`. See Response Store.
temperature	float	1.0
top_p	float	1.0
top_k	integer	-1	Furiosa-LLM extension; not part of the OpenResponses specification.
max_output_tokens	integer	null	If null, the server will use the maximum possible length considering the input.
presence_penalty	float	0.0	Accepted for compatibility but not yet functional.
frequency_penalty	float	0.0	Accepted for compatibility but not yet functional.
tools	array	[]	Function tool definitions. Only custom function tools are supported.
tool_choice	string or object	"auto"	Controls how tools are invoked. Supported values: `"none"` — disable tool calling; model output is returned as text. `"auto"` — the server detects tool calls in model output using the configured tool parser (requires `--enable-auto-tool-choice` and `--tool-call-parser`). `"required"` — model output is parsed as a JSON array of tool definitions (`[{"name": ..., "parameters": {...}}]`). `{"type": "function", "name": "<fn>"}` — all model output is treated as arguments for the named function.
text	object	null	Structured output configuration. Supports `text.format` with `type: "json_schema"` to constrain model output to a JSON Schema. Uses the same structured-output engine as the Chat Completions API.
previous_response_id	string	null	ID of a previously stored response to continue the conversation from.
truncation	string	"disabled"	Only `"disabled"` is currently supported.
reasoning	object	null	Accepted but not yet functional.
metadata	object	null
user	string	null

Responses API

Endpoints

Response Store

Multi-Turn Conversations

Using previous_response_id (recommended)

Using manual context

Examples

API Reference

On this page