Documentation Index
Fetch the complete documentation index at: https://mintlify.com/helicone/helicone/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Helicone’s caching feature stores and reuses responses from previous API requests, dramatically reducing costs and latency. When a cached response is returned, you avoid making a new request to the AI provider entirely.Caching is particularly effective for:
- Repeated queries (e.g., FAQ responses)
- Static prompts with consistent outputs
- Development and testing environments
- High-traffic applications with common requests
Key Benefits
Cost Reduction
Save up to 90% on API costs by serving cached responses instead of making new requests
Lower Latency
Cached responses are returned instantly, reducing response times from seconds to milliseconds
Bucket Caching
Store multiple variations of responses for the same prompt to maintain output diversity
Flexible TTL
Control cache expiration with custom time-to-live settings from hours to months
Quick Start
Enable caching by adding theHelicone-Cache-Enabled header to your requests:
- TypeScript
- Python
- cURL
Cache Headers
Basic Caching
| Header | Value | Description |
|---|---|---|
Helicone-Cache-Enabled | true | Enable both reading from and writing to cache |
Helicone-Cache-Save | true | Only save responses to cache (don’t read) |
Helicone-Cache-Read | true | Only read from cache (don’t save new responses) |
Advanced Options
| Header | Value | Description |
|---|---|---|
Cache-Control | max-age=3600 | Set cache TTL in seconds (default: 7 days, max: 365 days) |
Helicone-Cache-Bucket-Max-Size | 5 | Number of response variations to cache (1-20) |
Helicone-Cache-Seed | user-123 | Separate cache keys by custom identifier |
Bucket Caching
Bucket caching stores multiple response variations for the same request, useful for non-deterministic outputs:With bucket caching enabled:
- The first 10 requests generate new responses and fill the cache bucket
- Subsequent requests randomly select from the 10 cached responses
- Users experience variety without the cost of new API calls
Cache Key Generation
Helicone generates cache keys based on:- Request URL and endpoint
- Request body (excluding ignored keys)
- Authorization headers
- Custom cache headers
- Cache seed (if provided)
Ignoring Request Fields
Exclude fields from cache key generation to cache similar requests together:Cache Seeds
Use cache seeds to segment caches by user, tenant, or environment:TTL Configuration
Control how long responses stay cached:max-age=3600- 1 hourmax-age=86400- 1 daymax-age=604800- 1 week (default)max-age=2592000- 30 days
Cache Response Headers
Helicone adds headers to indicate cache status:| Header | Value | Description |
|---|---|---|
Helicone-Cache | HIT or MISS | Whether response was served from cache |
Helicone-Cache-Bucket-Idx | 3 | Which bucket index was used (for bucket caching) |
Helicone-Cache-Latency | 1523 | Original response latency in milliseconds |
Best Practices
Use appropriate TTLs
Use appropriate TTLs
Set shorter TTLs for dynamic content and longer TTLs for static responses:
- FAQ responses: 7-30 days
- News summaries: 1-6 hours
- Product descriptions: 7+ days
- Embeddings: 30+ days
Leverage bucket caching for variety
Leverage bucket caching for variety
For creative content or varied responses, use bucket caching with higher temperatures to maintain diversity while reducing costs.
Monitor cache hit rates
Monitor cache hit rates
Track the
Helicone-Cache header in your logs to measure cache effectiveness and optimize your caching strategy.Segment caches appropriately
Segment caches appropriately
Use cache seeds to separate caches for different users, tenants, or environments when responses should be personalized.
Test cache behavior in development
Test cache behavior in development
Use shorter TTLs and smaller bucket sizes during development to iterate quickly on cache configurations.
Limitations
- Maximum cache TTL: 365 days
- Maximum bucket size: 20 responses
- Cache timeout: 2 seconds (falls back to fresh request if cache is slow)
- Streaming responses are cached after completion
Related Features
Rate Limiting
Control API usage with custom rate limits
Cost Tracking
Monitor spending and savings from caching
