Responses API

Furiosa-LLM's OpenResponses-compatible API for text generation, supporting streaming, multi-turn conversations, tool calling, and structured output.

Furiosa-LLM implements the OpenResponses specification, a multi-provider, interoperable LLM interface. The Responses API is available for text generation models with a chat template. It supports text input, streaming, multi-turn conversations, tool calling with tool_choice control (see Tool Calling Support), and structured output via JSON Schema.

NOTE

The following features from the OpenResponses specification are not yet supported:

  • Multimodal inputsinput_image, input_audio, and input_file content types are accepted but silently ignored.
  • Built-in toolsweb_search, file_search, code_interpreter, computer_use, and mcp tools are not supported. Only custom function tools are available.
  • Background processingbackground=true is accepted but has no effect.
  • Auto-truncationtruncation="auto" is accepted but only "disabled" is implemented.

Endpoints

  • POST /v1/responses — Create a response (streaming or non-streaming).
  • GET /v1/responses/{response_id} — Retrieve a previously stored response.
  • POST /v1/responses/{response_id}/cancel — Cancel an in-progress response.

Response Store

The response store keeps responses in memory so they can be retrieved later or referenced by previous_response_id for multi-turn conversations. The store is disabled by default and must be explicitly enabled with the --enable-responses-api-store server option:

bash
furiosa-llm serve [ARTIFACT_PATH] --enable-responses-api-store

The following server options control the store behavior:

OptionDefaultDescription
--enable-responses-api-storefalseEnable the in-memory response store.
--responses-api-store-max-entries10000Maximum number of responses to keep. Oldest entries are evicted when the limit is reached.
--responses-api-store-ttl3600Time-to-live for stored responses in seconds.

When the store is enabled and store=true is set in the request (the default), the server stores the response and its full chat message history, enabling:

  • Response retrieval via GET /v1/responses/{response_id}.
  • Response cancellation via POST /v1/responses/{response_id}/cancel.
  • Multi-turn conversations via previous_response_id.

To skip storage for a specific request, set store=false.

When the store is disabled, GET /v1/responses/{response_id}, POST /v1/responses/{response_id}/cancel, and previous_response_id are not available.

NOTE

Stored responses and their conversation histories are held in memory and are lost when the server restarts. Monitor memory consumption on the server if you expect a large number of stored responses.

Multi-Turn Conversations

The Responses API supports two methods for multi-turn conversations:

The server automatically prepends the stored conversation history from the referenced response. This is the simplest approach and requires store=true on the referenced response.

python
import os
from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id

# Turn 1: store the response for later continuation
res1 = client.responses.create(
    model=model,
    input="My name is Alice.",
    store=True,
)
print(f"Turn 1: {res1.output[0].content[0].text}")
print(f"Response ID: {res1.id}")

# Turn 2: continue the conversation using previous_response_id
res2 = client.responses.create(
    model=model,
    input="What is my name?",
    previous_response_id=res1.id,
    store=True,
)
print(f"Turn 2: {res2.output[0].content[0].text}")

Using manual context

Alternatively, you can manually build the conversation context by appending previous output items to the input array:

python
# Turn 1
res1 = client.responses.create(model=model, input="My name is Alice.")

# Turn 2: manually include previous context
context = [{"role": "user", "content": "My name is Alice."}]
context += res1.output
context.append({"role": "user", "content": "What is my name?"})
res2 = client.responses.create(model=model, input=context)

Examples

Basic usage:

python
import os
from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id

# Non-streaming
response = client.responses.create(
    model=model,
    input="What is the capital of France?",
)
print(response.output[0].content[0].text)

Streaming:

python
import os
from openai import OpenAI

base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(api_key=api_key, base_url=base_url)
model = client.models.list().data[0].id

with client.responses.stream(
    model=model,
    input="What is the capital of France?",
) as stream:
    for event in stream:
        if event.type == "response.output_text.delta":
            print(event.delta, end="", flush=True)
print()

API Reference

Parameters without descriptions inherit their behavior and functionality from the corresponding parameters in the OpenResponses specification.

NOTE

For sampling-related fields (temperature, top_p, top_k, max_output_tokens), each field resolves in this order:

  1. The value specified in the request body, if set.
  2. The value from the model's generation_config.json, if present (max_output_tokens is populated from max_new_tokens).
  3. The API default shown in the parameter table below.

See Default Sampling Parameters from generation_config.json for the supported-field mapping and offline-API behavior.

NameTypeDefaultDescription
modelstringRequired by the client, but the value is ignored on the server.
inputstring or arrayText string, or an array of input items (messages, function call outputs). Multimodal content types (input_image, input_audio, input_file) are not yet supported.
instructionsstringnullSystem-level instructions prepended to the conversation.
streambooleanfalse
storebooleantrueWhen true and --enable-responses-api-store is set, the response is stored in-memory for later retrieval via previous_response_id and GET /v1/responses/{response_id}. See Response Store.
temperaturefloat1.0
top_pfloat1.0
top_kinteger-1Furiosa-LLM extension; not part of the OpenResponses specification.
max_output_tokensintegernullIf null, the server will use the maximum possible length considering the input.
presence_penaltyfloat0.0Accepted for compatibility but not yet functional.
frequency_penaltyfloat0.0Accepted for compatibility but not yet functional.
toolsarray[]Function tool definitions. Only custom function tools are supported.
tool_choicestring or object"auto"Controls how tools are invoked. Supported values:
  • "none" — disable tool calling; model output is returned as text.
  • "auto" — the server detects tool calls in model output using the configured tool parser (requires --enable-auto-tool-choice and --tool-call-parser).
  • "required" — model output is parsed as a JSON array of tool definitions ([{"name": ..., "parameters": {...}}]).
  • {"type": "function", "name": "<fn>"} — all model output is treated as arguments for the named function.
textobjectnullStructured output configuration. Supports text.format with type: "json_schema" to constrain model output to a JSON Schema. Uses the same structured-output engine as the Chat Completions API.
previous_response_idstringnullID of a previously stored response to continue the conversation from.
truncationstring"disabled"Only "disabled" is currently supported.
reasoningobjectnullAccepted but not yet functional.
metadataobjectnull
userstringnull

On this page