Add Scores to Request

Add evaluation scores to a request to track detailed quality metrics beyond simple thumbs up/down feedback. Scores allow you to measure specific dimensions of LLM outputs like accuracy, relevance, helpfulness, and custom evaluation criteria.

Scores support both integer and boolean values. Integer scores are stored as-is, while boolean values are converted to 1 (true) or 0 (false).

Path Parameters

requestId

string

required

The unique identifier of the request to add scores to. This can be found in the Helicone-Id response header when making requests through Helicone.Example: req_abc123def456

Request Body

scores

object

required

An object containing score key-value pairs. Each key is the score name, and each value is either an integer or boolean.Supported value types:

number - Must be an integer (floats are not supported)
boolean - Converted to 1 (true) or 0 (false)

Example:

{
  "scores": {
    "accuracy": 95,
    "relevance": 87,
    "helpfulness": 92,
    "has_citations": true,
    "is_factual": true
  }
}

Response

data

null

Returns null on success.

error

string | null

Error message if the request failed.

Examples

Add Basic Scores

Add quality scores to a request:

cURL

curl --request POST \
  --url https://api.helicone.ai/v1/request/req_abc123def456/score \
  --header 'Authorization: Bearer <HELICONE_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
  "scores": {
    "accuracy": 95,
    "relevance": 87,
    "helpfulness": 92
  }
}'

TypeScript

const requestId = 'req_abc123def456';

const response = await fetch(
  `https://api.helicone.ai/v1/request/${requestId}/score`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      scores: {
        accuracy: 95,
        relevance: 87,
        helpfulness: 92
      }
    })
  }
);

const result = await response.json();
console.log('Scores added successfully');

Python

import os
import requests

request_id = "req_abc123def456"

response = requests.post(
    f"https://api.helicone.ai/v1/request/{request_id}/score",
    headers={
        "Authorization": f"Bearer {os.environ['HELICONE_API_KEY']}",
        "Content-Type": "application/json"
    },
    json={
        "scores": {
            "accuracy": 95,
            "relevance": 87,
            "helpfulness": 92
        }
    }
)

result = response.json()
print("Scores added successfully")

Add Mixed Score Types

Combine integer and boolean scores:

cURL

curl --request POST \
  --url https://api.helicone.ai/v1/request/req_abc123def456/score \
  --header 'Authorization: Bearer <HELICONE_API_KEY>' \
  --header 'Content-Type: application/json' \
  --data '{
  "scores": {
    "overall_quality": 88,
    "coherence": 92,
    "has_citations": true,
    "is_factual": true,
    "contains_errors": false
  }
}'

TypeScript

const response = await fetch(
  `https://api.helicone.ai/v1/request/${requestId}/score`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      scores: {
        overall_quality: 88,
        coherence: 92,
        has_citations: true,
        is_factual: true,
        contains_errors: false
      }
    })
  }
);

Use Cases

Automated LLM-as-Judge Evaluation

Use an LLM to evaluate another LLM’s output:

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'https://gateway.helicone.ai/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`
  }
});

// Make the initial request
const { data, response } = await client.chat.completions
  .create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: 'Explain quantum computing' }]
  })
  .withResponse();

const requestId = response.headers.get('helicone-id');
const responseText = data.choices[0].message.content;

// Use GPT-4 to evaluate the response
const evaluation = await client.chat.completions.create({
  model: 'gpt-4',
  messages: [
    {
      role: 'system',
      content: `Evaluate this response on a scale of 0-100 for:
      - Accuracy: How factually correct is the information?
      - Clarity: How easy is it to understand?
      - Completeness: Does it fully answer the question?
      
      Return ONLY a JSON object with these scores.`
    },
    {
      role: 'user',
      content: `Response to evaluate: ${responseText}`
    }
  ],
  response_format: { type: 'json_object' }
});

const scores = JSON.parse(evaluation.choices[0].message.content);

// Add scores to the original request
await fetch(
  `https://api.helicone.ai/v1/request/${requestId}/score`,
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ scores })
  }
);

Human Evaluation Workflow

Collect detailed human evaluations:

interface EvaluationForm {
  accuracy: number;
  relevance: number;
  helpfulness: number;
  tone: number;
  followedInstructions: boolean;
  containedErrors: boolean;
}

const submitEvaluation = async (
  requestId: string,
  evaluation: EvaluationForm
) => {
  await fetch(
    `https://api.helicone.ai/v1/request/${requestId}/score`,
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        scores: {
          accuracy: evaluation.accuracy,
          relevance: evaluation.relevance,
          helpfulness: evaluation.helpfulness,
          tone: evaluation.tone,
          followed_instructions: evaluation.followedInstructions,
          contained_errors: evaluation.containedErrors
        }
      })
    }
  );
};

// Usage in evaluation UI
const handleEvaluationSubmit = async (formData: EvaluationForm) => {
  await submitEvaluation('req_abc123def456', formData);
  console.log('Evaluation submitted successfully');
};

Automated Quality Checks

Implement automated quality scoring:

const evaluateResponse = (responseText: string) => {
  const scores = {
    word_count: responseText.split(' ').length,
    has_citations: /\[\d+\]|\(\d{4}\)/.test(responseText),
    has_code_examples: responseText.includes('```'),
    starts_with_greeting: /^(hello|hi|hey)/i.test(responseText),
    exceeds_min_length: responseText.length > 100,
    contains_markdown: /[#*_`]/.test(responseText)
  };
  
  return scores;
};

const scoreRequest = async (requestId: string, responseText: string) => {
  const scores = evaluateResponse(responseText);
  
  await fetch(
    `https://api.helicone.ai/v1/request/${requestId}/score`,
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ scores })
    }
  );
};

Multi-Criteria RAG Evaluation

Evaluate RAG (Retrieval-Augmented Generation) systems:

const evaluateRAG = async (
  requestId: string,
  response: string,
  context: string[],
  userQuery: string
) => {
  // Evaluate different aspects of RAG quality
  const scores = {
    // Answer quality
    answer_relevance: await scoreAnswerRelevance(response, userQuery),
    answer_completeness: await scoreCompleteness(response, userQuery),
    
    // Context quality
    context_relevance: await scoreContextRelevance(context, userQuery),
    context_precision: await scoreContextPrecision(context, response),
    
    // Faithfulness
    faithfulness: await scoreFaithfulness(response, context),
    has_hallucinations: await detectHallucinations(response, context),
    
    // Additional checks
    uses_all_context: checkContextUsage(response, context),
    citations_provided: response.includes('[') && response.includes(']')
  };
  
  await fetch(
    `https://api.helicone.ai/v1/request/${requestId}/score`,
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ scores })
    }
  );
  
  return scores;
};

Comparative Evaluation (A/B Testing)

Compare different models or prompts:

const runComparison = async (userQuery: string) => {
  // Test variant A
  const { data: dataA, response: responseA } = await client.chat.completions
    .create(
      {
        model: 'gpt-4',
        messages: [{ role: 'user', content: userQuery }],
        temperature: 0.7
      },
      {
        headers: {
          'Helicone-Property-Variant': 'A',
          'Helicone-Property-Temperature': '0.7'
        }
      }
    )
    .withResponse();
  
  const requestIdA = responseA.headers.get('helicone-id');
  
  // Test variant B
  const { data: dataB, response: responseB } = await client.chat.completions
    .create(
      {
        model: 'gpt-4',
        messages: [{ role: 'user', content: userQuery }],
        temperature: 0.3
      },
      {
        headers: {
          'Helicone-Property-Variant': 'B',
          'Helicone-Property-Temperature': '0.3'
        }
      }
    )
    .withResponse();
  
  const requestIdB = responseB.headers.get('helicone-id');
  
  // Evaluate both
  const scoresA = await evaluateLLMResponse(dataA.choices[0].message.content);
  const scoresB = await evaluateLLMResponse(dataB.choices[0].message.content);
  
  // Add scores
  await Promise.all([
    fetch(`https://api.helicone.ai/v1/request/${requestIdA}/score`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ scores: scoresA })
    }),
    fetch(`https://api.helicone.ai/v1/request/${requestIdB}/score`, {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ scores: scoresB })
    })
  ]);
};

Custom Evaluation Framework

Build a reusable evaluation framework:

class LLMEvaluator {
  private apiKey: string;
  
  constructor(apiKey: string) {
    this.apiKey = apiKey;
  }
  
  async evaluate(
    requestId: string,
    response: string,
    criteria: string[]
  ): Promise<Record<string, number>> {
    const scores: Record<string, number> = {};
    
    // Run each evaluation criterion
    for (const criterion of criteria) {
      scores[criterion] = await this.evaluateCriterion(
        response,
        criterion
      );
    }
    
    // Submit scores
    await fetch(
      `https://api.helicone.ai/v1/request/${requestId}/score`,
      {
        method: 'POST',
        headers: {
          'Authorization': `Bearer ${this.apiKey}`,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({ scores })
      }
    );
    
    return scores;
  }
  
  private async evaluateCriterion(
    response: string,
    criterion: string
  ): Promise<number> {
    // Implement your evaluation logic here
    // This could use another LLM, heuristics, or external APIs
    return 0;
  }
}

// Usage
const evaluator = new LLMEvaluator(process.env.HELICONE_API_KEY);
const scores = await evaluator.evaluate(
  'req_abc123def456',
  responseText,
  ['accuracy', 'clarity', 'completeness', 'tone']
);

Querying by Scores

Query requests based on score values:

// Find all high-quality responses (accuracy > 90)
const response = await fetch(
  'https://api.helicone.ai/v1/request/query',
  {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.HELICONE_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      filter: {
        request_response_rmt: {
          scores: {
            accuracy: {
              gte: 90
            }
          }
        }
      },
      limit: 100
    })
  }
);

Score Value Constraints

Important constraints:

Only integer values are supported (no decimals/floats)
Boolean values are automatically converted to 1 (true) or 0 (false)
Score keys should be descriptive and consistent across your application

// ✅ Valid scores
{
  "accuracy": 95,           // Integer
  "has_citations": true,    // Boolean (converted to 1)
  "contains_errors": false  // Boolean (converted to 0)
}

// ❌ Invalid scores
{
  "accuracy": 95.5,         // Float - will cause error
  "score": "high"           // String - will cause error
}

Best Practices

Consistent Naming: Use consistent score names across your evaluation workflows
Integer Values: Always use integers for numeric scores (0-100 scale is common)
Boolean Flags: Use booleans for yes/no criteria (presence of citations, factual accuracy, etc.)
Multiple Dimensions: Track multiple aspects of quality for comprehensive evaluation
Automated + Human: Combine automated scoring with periodic human evaluation
Threshold Alerts: Set up monitoring for scores below certain thresholds

Get Request by ID

Retrieve request with all scores

Query Requests

Query requests filtered by scores

Add Feedback

Add simple thumbs up/down feedback

Add Properties

Add custom properties to requests

Overview

AI Gateway

Requests

Sessions

Prompts

Evaluations

Webhooks

Users

Add Scores to Request

Path Parameters

Request Body

Response

Examples

Add Basic Scores

Add Mixed Score Types

Use Cases

Automated LLM-as-Judge Evaluation

Human Evaluation Workflow

Automated Quality Checks

Multi-Criteria RAG Evaluation

Comparative Evaluation (A/B Testing)

Custom Evaluation Framework

Querying by Scores

Score Value Constraints

Best Practices

Get Request by ID

Query Requests

Add Feedback

Add Properties

Overview

AI Gateway

Requests

Sessions

Prompts

Evaluations

Webhooks

Users

Documentation Index

​Path Parameters

​Request Body

​Response

​Examples

​Add Basic Scores

​Add Mixed Score Types

​Use Cases

​Automated LLM-as-Judge Evaluation

​Human Evaluation Workflow

​Automated Quality Checks

​Multi-Criteria RAG Evaluation

​Comparative Evaluation (A/B Testing)

​Custom Evaluation Framework

​Querying by Scores

​Score Value Constraints

​Best Practices

​Related Endpoints

Get Request by ID

Query Requests

Add Feedback

Add Properties

Path Parameters

Request Body

Response

Examples

Add Basic Scores

Add Mixed Score Types

Use Cases

Automated LLM-as-Judge Evaluation

Human Evaluation Workflow

Automated Quality Checks

Multi-Criteria RAG Evaluation

Comparative Evaluation (A/B Testing)

Custom Evaluation Framework

Querying by Scores

Score Value Constraints

Best Practices

Related Endpoints