Embedding
Generate embeddings using the LLM.embed() method for similarity search, clustering, and retrieval applications.
This example demonstrates how to generate embeddings using the LLM.embed() method.
Embeddings are dense vector representations of text that capture semantic meaning,
useful for similarity search, clustering, and retrieval applications.
The following example demonstrates basic usage, batch embedding, normalization, and truncation:
from furiosa_llm import LLM, PoolingParams
# Load an embedding model
with LLM("furiosa-ai/Qwen3-Embedding-8B") as llm:
# ============================================================
# Example 1: Single prompt embedding
# ============================================================
prompt = "What is the capital of France?"
output = llm.embed(prompt)
embedding = output[0].outputs.embedding
print(f"Prompt: {prompt!r}")
print(f"Embedding dimension: {len(embedding)}")
print(f"Embedding (first 10 values): {embedding[:10]}")
print("-" * 80)
# ============================================================
# Example 2: Batch embedding (multiple prompts)
# ============================================================
prompts = [
"What is the capital of France?",
"What is the capital of Germany?",
"What is the capital of Italy?",
]
outputs = llm.embed(prompts)
for prompt, output in zip(prompts, outputs):
embedding = output.outputs.embedding
print(f"Prompt: {prompt!r}")
print(f"Embedding dimension: {len(embedding)}")
print(f"Embedding (first 5 values): {embedding[:5]}")
print("-" * 80)
# ============================================================
# Example 3: Using PoolingParams for truncation
# ============================================================
# Truncate long prompts to fit within token limits
pooling_params = PoolingParams(truncate_prompt_tokens=128)
long_prompts = [
"This is a very long text that might exceed the model's context window. " * 50,
"Another lengthy document that needs to be truncated for processing. " * 50,
]
outputs = llm.embed(long_prompts, pooling_params=pooling_params)
for i, output in enumerate(outputs):
embedding = output.outputs.embedding
print(f"Long prompt {i}: embedding dimension = {len(embedding)}")Server API Example
You can also generate embeddings through the OpenAI-compatible server:
import os
from openai import OpenAI
# Start server with: furiosa-llm serve path/to/embedding/model
base_url = os.getenv("OPENAI_BASE_URL", "http://localhost:8000/v1")
api_key = os.getenv("OPENAI_API_KEY", "EMPTY")
client = OpenAI(base_url=base_url, api_key=api_key)
response = client.embeddings.create(
model="embedding-model",
input=["Text 1", "Text 2", "Text 3"],
)
for data in response.data:
embedding = data.embedding
print(f"Index {data.index}: {len(embedding)} dimensions")See Embeddings API Reference for complete server API documentation.