Features
Model Evaluation in Foundation Studio provides automated, framework-specific scoring of model outputs using an LLM-as-judge approach — giving you structured quality signals without manual annotation.
1. Evaluation Frameworks
Model Evaluation supports four task-specific frameworks, each scoring outputs across a fixed set of 4 quality metrics. The evaluator LLM (Llama 3.1 8B Instruct) scores each row in your dataset against the metrics defined by the chosen framework.
Text Summarization
Designed for models that generate summaries of source documents.
| Metric | What it measures |
|---|---|
| Coherence | Logical flow and clarity of the summary |
| Conciseness | Brevity while retaining core meaning |
| Hallucination | Fabricated or unsupported information not present in the source |
| Informativeness | Relevance and usefulness of captured content |
General Assistant
Designed for open-ended conversational or instruction-following models.
| Metric | What it measures |
|---|---|
| Relevance | Contextual appropriateness of the response |
| Consistency | Logical coherence and internal consistency |
| Bias | Skewed, unfair, or one-sided content |
| Toxicity | Offensive or inappropriate content |
Question Answering
Designed for models that answer factual or context-based questions.
| Metric | What it measures |
|---|---|
| Completeness | Whether the answer fully addresses the question |
| Correctness | Factual accuracy of the answer |
| Precision | Specificity and exactness of the answer |
| Toxicity | Offensive content in the response |
Text Classification
Designed for models that label or categorize text inputs.
| Metric | What it measures |
|---|---|
| Accuracy | Percentage of correctly classified examples |
| Precision | Ratio of true positives among predicted positives |
| Recall | Ability to identify all relevant instances |
| Consistency | Reliability across similar inputs |
2. Model Evaluation Prompt
An optional free-text prompt that provides additional context or instructions to the evaluator model. Use this to:
- Focus the evaluation on specific quality aspects
- Add domain context (e.g. "This model is used for medical Q&A")
- Instruct the evaluator to penalize specific failure modes
Example:
Please ensure that the summarization does not introduce fabricated details.
Penalize heavily for hallucinations even if the summary is otherwise concise.
3. Results Storage
Evaluation results are stored automatically to an EOS bucket of your choice.
- Results are written to a folder named after your job (e.g.
tir-job-12181424077) at the root of the selected dataset bucket. - The Evaluation Results tab on the job detail page provides an in-platform view of the scores.
- Download raw result files directly from your EOS bucket for further analysis.
4. Job Monitoring
Overview
Shows job configuration, assigned resources, current status, and metadata.
Events
Tracks pod lifecycle events: scheduling, container initialization, and termination. Useful for diagnosing startup failures.
Logs
Real-time logs from the evaluation job. Use these to monitor progress and diagnose data loading or scoring errors.
Evaluation Results
Displays a structured breakdown of scores across the 4 metrics for the selected framework. Each metric is scored by the evaluator model for every row evaluated.
5. Job Actions
| Action | When to use | State required |
|---|---|---|
| Retry | Re-run a job that ended in a failed state | Failed |
| Terminate | Stop a job that is currently running | Running |
| Delete | Remove a job and its metadata permanently | Any state |
Retry restarts the job with the original settings — no re-configuration needed.
Delete removes the job record from Foundation Studio but does not delete results stored in your EOS bucket.