---
title: Features
---

# Features

> Model Evaluation in Foundation Studio provides automated, framework-specific scoring of model outputs using an LLM-as-judge approach — giving you structured quality signals without manual annotation.

---

## 1. Evaluation Frameworks

Model Evaluation supports four task-specific frameworks, each scoring outputs across a fixed set of 4 quality metrics. The evaluator LLM (**Llama 3.1 8B Instruct**) scores each row in your dataset against the metrics defined by the chosen framework.

### Text Summarization

Designed for models that generate summaries of source documents.

| Metric | What it measures |
|--------|-----------------|
| **Coherence** | Logical flow and clarity of the summary |
| **Conciseness** | Brevity while retaining core meaning |
| **Hallucination** | Fabricated or unsupported information not present in the source |
| **Informativeness** | Relevance and usefulness of captured content |

### General Assistant

Designed for open-ended conversational or instruction-following models.

| Metric | What it measures |
|--------|-----------------|
| **Relevance** | Contextual appropriateness of the response |
| **Consistency** | Logical coherence and internal consistency |
| **Bias** | Skewed, unfair, or one-sided content |
| **Toxicity** | Offensive or inappropriate content |

### Question Answering

Designed for models that answer factual or context-based questions.

| Metric | What it measures |
|--------|-----------------|
| **Completeness** | Whether the answer fully addresses the question |
| **Correctness** | Factual accuracy of the answer |
| **Precision** | Specificity and exactness of the answer |
| **Toxicity** | Offensive content in the response |

### Text Classification

Designed for models that label or categorize text inputs.

| Metric | What it measures |
|--------|-----------------|
| **Accuracy** | Percentage of correctly classified examples |
| **Precision** | Ratio of true positives among predicted positives |
| **Recall** | Ability to identify all relevant instances |
| **Consistency** | Reliability across similar inputs |

---

## 2. Model Evaluation Prompt

An optional free-text prompt that provides additional context or instructions to the evaluator model. Use this to:

- Focus the evaluation on specific quality aspects
- Add domain context (e.g. "This model is used for medical Q&A")
- Instruct the evaluator to penalize specific failure modes

**Example:**
```
Please ensure that the summarization does not introduce fabricated details.
Penalize heavily for hallucinations even if the summary is otherwise concise.
```

---

## 3. Results Storage

Evaluation results are stored automatically to an EOS bucket of your choice.

- Results are written to a folder named after your job (e.g. `tir-job-12181424077`) at the root of the selected dataset bucket.
- The **Evaluation Results** tab on the job detail page provides an in-platform view of the scores.
- Download raw result files directly from your EOS bucket for further analysis.

---

## 4. Job Monitoring

### Overview
Shows job configuration, assigned resources, current status, and metadata.

### Events
Tracks pod lifecycle events: scheduling, container initialization, and termination. Useful for diagnosing startup failures.

### Logs
Real-time logs from the evaluation job. Use these to monitor progress and diagnose data loading or scoring errors.

### Evaluation Results
Displays a structured breakdown of scores across the 4 metrics for the selected framework. Each metric is scored by the evaluator model for every row evaluated.

---

## 5. Job Actions

| Action | When to use | State required |
|--------|------------|----------------|
| **Retry** | Re-run a job that ended in a failed state | Failed |
| **Terminate** | Stop a job that is currently running | Running |
| **Delete** | Remove a job and its metadata permanently | Any state |

**Retry** restarts the job with the original settings — no re-configuration needed.

**Delete** removes the job record from Foundation Studio but does not delete results stored in your EOS bucket.


---