Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Helicone’s caching feature stores and reuses responses from previous API requests, dramatically reducing costs and latency. When a cached response is returned, you avoid making a new request to the AI provider entirely.
Caching is particularly effective for:
  • Repeated queries (e.g., FAQ responses)
  • Static prompts with consistent outputs
  • Development and testing environments
  • High-traffic applications with common requests

Key Benefits

Cost Reduction

Save up to 90% on API costs by serving cached responses instead of making new requests

Lower Latency

Cached responses are returned instantly, reducing response times from seconds to milliseconds

Bucket Caching

Store multiple variations of responses for the same prompt to maintain output diversity

Flexible TTL

Control cache expiration with custom time-to-live settings from hours to months

Quick Start

Enable caching by adding the Helicone-Cache-Enabled header to your requests:
import { OpenAI } from "openai";

const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
  },
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "What is machine learning?" }],
});

// Check if response was cached
console.log(response.headers.get("Helicone-Cache")); // "HIT" or "MISS"

Cache Headers

Basic Caching

HeaderValueDescription
Helicone-Cache-EnabledtrueEnable both reading from and writing to cache
Helicone-Cache-SavetrueOnly save responses to cache (don’t read)
Helicone-Cache-ReadtrueOnly read from cache (don’t save new responses)

Advanced Options

HeaderValueDescription
Cache-Controlmax-age=3600Set cache TTL in seconds (default: 7 days, max: 365 days)
Helicone-Cache-Bucket-Max-Size5Number of response variations to cache (1-20)
Helicone-Cache-Seeduser-123Separate cache keys by custom identifier

Bucket Caching

Bucket caching stores multiple response variations for the same request, useful for non-deterministic outputs:
const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Bucket-Max-Size": "10",
  },
});

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Tell me a joke" }],
  temperature: 0.8, // Higher temperature = more variation
});
With bucket caching enabled:
  1. The first 10 requests generate new responses and fill the cache bucket
  2. Subsequent requests randomly select from the 10 cached responses
  3. Users experience variety without the cost of new API calls

Cache Key Generation

Helicone generates cache keys based on:
  • Request URL and endpoint
  • Request body (excluding ignored keys)
  • Authorization headers
  • Custom cache headers
  • Cache seed (if provided)

Ignoring Request Fields

Exclude fields from cache key generation to cache similar requests together:
const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
    "Helicone-Cache-Ignore-Keys": JSON.stringify(["temperature", "max_tokens"]),
  },
});

Cache Seeds

Use cache seeds to segment caches by user, tenant, or environment:
const getCachedResponse = async (userId: string) => {
  const client = new OpenAI({
    baseURL: "https://ai-gateway.helicone.ai",
    apiKey: process.env.HELICONE_API_KEY,
    defaultHeaders: {
      "Helicone-Cache-Enabled": "true",
      "Helicone-Cache-Seed": userId,
    },
  });

  return await client.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: "Get my personalized summary" }],
  });
};

TTL Configuration

Control how long responses stay cached:
const client = new OpenAI({
  baseURL: "https://ai-gateway.helicone.ai",
  apiKey: process.env.HELICONE_API_KEY,
  defaultHeaders: {
    "Helicone-Cache-Enabled": "true",
    "Cache-Control": "max-age=86400", // 24 hours
  },
});
Common TTL values:
  • max-age=3600 - 1 hour
  • max-age=86400 - 1 day
  • max-age=604800 - 1 week (default)
  • max-age=2592000 - 30 days

Cache Response Headers

Helicone adds headers to indicate cache status:
HeaderValueDescription
Helicone-CacheHIT or MISSWhether response was served from cache
Helicone-Cache-Bucket-Idx3Which bucket index was used (for bucket caching)
Helicone-Cache-Latency1523Original response latency in milliseconds

Best Practices

Set shorter TTLs for dynamic content and longer TTLs for static responses:
  • FAQ responses: 7-30 days
  • News summaries: 1-6 hours
  • Product descriptions: 7+ days
  • Embeddings: 30+ days
For creative content or varied responses, use bucket caching with higher temperatures to maintain diversity while reducing costs.
Track the Helicone-Cache header in your logs to measure cache effectiveness and optimize your caching strategy.
Use cache seeds to separate caches for different users, tenants, or environments when responses should be personalized.
Use shorter TTLs and smaller bucket sizes during development to iterate quickly on cache configurations.

Limitations

  • Maximum cache TTL: 365 days
  • Maximum bucket size: 20 responses
  • Cache timeout: 2 seconds (falls back to fresh request if cache is slow)
  • Streaming responses are cached after completion

Rate Limiting

Control API usage with custom rate limits

Cost Tracking

Monitor spending and savings from caching