Response Caching

Overview

Helicone’s caching feature stores and reuses responses from previous API requests, dramatically reducing costs and latency. When a cached response is returned, you avoid making a new request to the AI provider entirely.

Caching is particularly effective for:

Repeated queries (e.g., FAQ responses)
Static prompts with consistent outputs
Development and testing environments
High-traffic applications with common requests

Key Benefits

Cost Reduction

Save up to 90% on API costs by serving cached responses instead of making new requests

Lower Latency

Cached responses are returned instantly, reducing response times from seconds to milliseconds

Bucket Caching

Store multiple variations of responses for the same prompt to maintain output diversity

Flexible TTL

Control cache expiration with custom time-to-live settings from hours to months

Quick Start

Enable caching by adding the Helicone-Cache-Enabled header to your requests:

TypeScript
Python
cURL

import { OpenAI } from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
  },
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "What is machine learning?" }],
});

// Check if response was cached
console.log(response.headers.get("Helicone-Cache")); // "HIT" or "MISS"

from openai import OpenAI

client = OpenAI(
    base_url="https://ai-gateway.helicone.ai",
    api_key=os.getenv("HELICONE_API_KEY"),
    default_headers={
        "Helicone-Cache-Enabled": "true",
    },
)

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is machine learning?"}]
)

# Check if response was cached
print(response.headers.get("Helicone-Cache"))  # "HIT" or "MISS"

curl https://ai-gateway.helicone.ai/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $HELICONE_API_KEY" \
  -H "Helicone-Cache-Enabled: true" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is machine learning?"}]
  }'

Cache Headers

Basic Caching

Header	Value	Description
`Helicone-Cache-Enabled`	`true`	Enable both reading from and writing to cache
`Helicone-Cache-Save`	`true`	Only save responses to cache (don’t read)
`Helicone-Cache-Read`	`true`	Only read from cache (don’t save new responses)

Advanced Options

Header	Value	Description
`Cache-Control`	`max-age=3600`	Set cache TTL in seconds (default: 7 days, max: 365 days)
`Helicone-Cache-Bucket-Max-Size`	`5`	Number of response variations to cache (1-20)
`Helicone-Cache-Seed`	`user-123`	Separate cache keys by custom identifier

Bucket Caching

Bucket caching stores multiple response variations for the same request, useful for non-deterministic outputs:

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Bucket-Max-Size": "10",
  },
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Tell me a joke" }],
  temperature: 0.8, // Higher temperature = more variation
});

With bucket caching enabled:

The first 10 requests generate new responses and fill the cache bucket
Subsequent requests randomly select from the 10 cached responses
Users experience variety without the cost of new API calls

Cache Key Generation

Helicone generates cache keys based on:

Request URL and endpoint
Request body (excluding ignored keys)
Authorization headers
Custom cache headers
Cache seed (if provided)

Ignoring Request Fields

Exclude fields from cache key generation to cache similar requests together:

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Ignore-Keys": JSON.stringify(["temperature", "max_tokens"]),
  },
});

Cache Seeds

Use cache seeds to segment caches by user, tenant, or environment:

const getCachedResponse = async (userId: string) => {
  const client = new OpenAI({
    baseURL: "https://ai-gateway.helicone.ai",
    apiKey: process.env.HELICONE_API_KEY,
    defaultHeaders: {
      "Helicone-Cache-Enabled": "true",
      "Helicone-Cache-Seed": userId,
    },
  });

  return await client.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Get my personalized summary" }],
  });
};

TTL Configuration

Control how long responses stay cached:

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
    "Cache-Control": "max-age=86400", // 24 hours
  },
});

Common TTL values:

max-age=3600 - 1 hour
max-age=86400 - 1 day
max-age=604800 - 1 week (default)
max-age=2592000 - 30 days

Cache Response Headers

Helicone adds headers to indicate cache status:

Header	Value	Description
`Helicone-Cache`	`HIT` or `MISS`	Whether response was served from cache
`Helicone-Cache-Bucket-Idx`	`3`	Which bucket index was used (for bucket caching)
`Helicone-Cache-Latency`	`1523`	Original response latency in milliseconds

Best Practices

Use appropriate TTLs

Set shorter TTLs for dynamic content and longer TTLs for static responses:

FAQ responses: 7-30 days
News summaries: 1-6 hours
Product descriptions: 7+ days
Embeddings: 30+ days

Leverage bucket caching for variety

For creative content or varied responses, use bucket caching with higher temperatures to maintain diversity while reducing costs.

Monitor cache hit rates

Track the Helicone-Cache header in your logs to measure cache effectiveness and optimize your caching strategy.

Segment caches appropriately

Use cache seeds to separate caches for different users, tenants, or environments when responses should be personalized.

Test cache behavior in development

Use shorter TTLs and smaller bucket sizes during development to iterate quickly on cache configurations.

Limitations

Maximum cache TTL: 365 days
Maximum bucket size: 20 responses
Cache timeout: 2 seconds (falls back to fresh request if cache is slow)
Streaming responses are cached after completion

Rate Limiting

Control API usage with custom rate limits

Cost Tracking

Monitor spending and savings from caching

Get Started

AI Gateway

Observability

Prompt Management

Evaluation & Testing

Features

Self-Hosting

Integrations

Overview

Key Benefits

Cost Reduction

Lower Latency

Bucket Caching

Flexible TTL

Quick Start

Cache Headers

Basic Caching

Advanced Options

Bucket Caching

Cache Key Generation

Ignoring Request Fields

Cache Seeds

TTL Configuration

Cache Response Headers

Best Practices

Limitations

Rate Limiting

Cost Tracking

Get Started

AI Gateway

Observability

Prompt Management

Evaluation & Testing

Features

Self-Hosting

Integrations

Documentation Index

​Overview

​Key Benefits

Cost Reduction

Lower Latency

Bucket Caching

Flexible TTL

​Quick Start

​Cache Headers

​Basic Caching

​Advanced Options

​Bucket Caching

​Cache Key Generation

​Ignoring Request Fields

​Cache Seeds

​TTL Configuration

​Cache Response Headers

​Best Practices

​Limitations

​Related Features

Rate Limiting

Cost Tracking

Overview

Key Benefits

Quick Start

Cache Headers

Basic Caching

Advanced Options

Bucket Caching

Cache Key Generation

Ignoring Request Fields

Cache Seeds

TTL Configuration

Cache Response Headers

Best Practices

Limitations

Related Features