Features

Model Evaluation in Foundation Studio provides automated, framework-specific scoring of model outputs using an LLM-as-judge approach — giving you structured quality signals without manual annotation.

1. Evaluation Frameworks

Model Evaluation supports four task-specific frameworks, each scoring outputs across a fixed set of 4 quality metrics. The evaluator LLM (Llama 3.1 8B Instruct) scores each row in your dataset against the metrics defined by the chosen framework.

Text Summarization

Designed for models that generate summaries of source documents.

Metric	What it measures
Coherence	Logical flow and clarity of the summary
Conciseness	Brevity while retaining core meaning
Hallucination	Fabricated or unsupported information not present in the source
Informativeness	Relevance and usefulness of captured content

General Assistant

Designed for open-ended conversational or instruction-following models.

Metric	What it measures
Relevance	Contextual appropriateness of the response
Consistency	Logical coherence and internal consistency
Bias	Skewed, unfair, or one-sided content
Toxicity	Offensive or inappropriate content

Question Answering

Designed for models that answer factual or context-based questions.

Metric	What it measures
Completeness	Whether the answer fully addresses the question
Correctness	Factual accuracy of the answer
Precision	Specificity and exactness of the answer
Toxicity	Offensive content in the response

Text Classification

Designed for models that label or categorize text inputs.

Metric	What it measures
Accuracy	Percentage of correctly classified examples
Precision	Ratio of true positives among predicted positives
Recall	Ability to identify all relevant instances
Consistency	Reliability across similar inputs

2. Model Evaluation Prompt

An optional free-text prompt that provides additional context or instructions to the evaluator model. Use this to:

Focus the evaluation on specific quality aspects
Add domain context (e.g. "This model is used for medical Q&A")
Instruct the evaluator to penalize specific failure modes

Example:

Please ensure that the summarization does not introduce fabricated details.
Penalize heavily for hallucinations even if the summary is otherwise concise.

3. Results Storage

Evaluation results are stored automatically to an EOS bucket of your choice.

Results are written to a folder named after your job (e.g. tir-job-12181424077) at the root of the selected dataset bucket.
The Evaluation Results tab on the job detail page provides an in-platform view of the scores.
Download raw result files directly from your EOS bucket for further analysis.

4. Job Monitoring

Overview

Shows job configuration, assigned resources, current status, and metadata.

Events

Tracks pod lifecycle events: scheduling, container initialization, and termination. Useful for diagnosing startup failures.

Logs

Real-time logs from the evaluation job. Use these to monitor progress and diagnose data loading or scoring errors.

Evaluation Results

Displays a structured breakdown of scores across the 4 metrics for the selected framework. Each metric is scored by the evaluator model for every row evaluated.

5. Job Actions

Action	When to use	State required
Retry	Re-run a job that ended in a failed state	Failed
Terminate	Stop a job that is currently running	Running
Delete	Remove a job and its metadata permanently	Any state

Retry restarts the job with the original settings — no re-configuration needed.

Delete removes the job record from Foundation Studio but does not delete results stored in your EOS bucket.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

1. Evaluation Frameworks​

Text Summarization​

General Assistant​

Question Answering​

Text Classification​

2. Model Evaluation Prompt​

3. Results Storage​

4. Job Monitoring​

Overview​

Events​

Logs​

Evaluation Results​

5. Job Actions​

1. Evaluation Frameworks

Text Summarization

General Assistant

Question Answering

Text Classification

2. Model Evaluation Prompt

3. Results Storage

4. Job Monitoring

Overview

Events

Logs

Evaluation Results

5. Job Actions