Skip to main content

Features

Model Evaluation in Foundation Studio provides automated, framework-specific scoring of model outputs using an LLM-as-judge approach — giving you structured quality signals without manual annotation.


1. Evaluation Frameworks

Model Evaluation supports four task-specific frameworks, each scoring outputs across a fixed set of 4 quality metrics. The evaluator LLM (Llama 3.1 8B Instruct) scores each row in your dataset against the metrics defined by the chosen framework.

Text Summarization

Designed for models that generate summaries of source documents.

MetricWhat it measures
CoherenceLogical flow and clarity of the summary
ConcisenessBrevity while retaining core meaning
HallucinationFabricated or unsupported information not present in the source
InformativenessRelevance and usefulness of captured content

General Assistant

Designed for open-ended conversational or instruction-following models.

MetricWhat it measures
RelevanceContextual appropriateness of the response
ConsistencyLogical coherence and internal consistency
BiasSkewed, unfair, or one-sided content
ToxicityOffensive or inappropriate content

Question Answering

Designed for models that answer factual or context-based questions.

MetricWhat it measures
CompletenessWhether the answer fully addresses the question
CorrectnessFactual accuracy of the answer
PrecisionSpecificity and exactness of the answer
ToxicityOffensive content in the response

Text Classification

Designed for models that label or categorize text inputs.

MetricWhat it measures
AccuracyPercentage of correctly classified examples
PrecisionRatio of true positives among predicted positives
RecallAbility to identify all relevant instances
ConsistencyReliability across similar inputs

2. Model Evaluation Prompt

An optional free-text prompt that provides additional context or instructions to the evaluator model. Use this to:

  • Focus the evaluation on specific quality aspects
  • Add domain context (e.g. "This model is used for medical Q&A")
  • Instruct the evaluator to penalize specific failure modes

Example:

Please ensure that the summarization does not introduce fabricated details.
Penalize heavily for hallucinations even if the summary is otherwise concise.

3. Results Storage

Evaluation results are stored automatically to an EOS bucket of your choice.

  • Results are written to a folder named after your job (e.g. tir-job-12181424077) at the root of the selected dataset bucket.
  • The Evaluation Results tab on the job detail page provides an in-platform view of the scores.
  • Download raw result files directly from your EOS bucket for further analysis.

4. Job Monitoring

Overview

Shows job configuration, assigned resources, current status, and metadata.

Events

Tracks pod lifecycle events: scheduling, container initialization, and termination. Useful for diagnosing startup failures.

Logs

Real-time logs from the evaluation job. Use these to monitor progress and diagnose data loading or scoring errors.

Evaluation Results

Displays a structured breakdown of scores across the 4 metrics for the selected framework. Each metric is scored by the evaluator model for every row evaluated.


5. Job Actions

ActionWhen to useState required
RetryRe-run a job that ended in a failed stateFailed
TerminateStop a job that is currently runningRunning
DeleteRemove a job and its metadata permanentlyAny state

Retry restarts the job with the original settings — no re-configuration needed.

Delete removes the job record from Foundation Studio but does not delete results stored in your EOS bucket.