Quick Start Guide
Create your first Model Evaluation job and score your model's outputs using an automated LLM-as-judge approach. This guide walks through the complete setup flow.
What you need before you start
| Requirement | Details |
|---|---|
| Account | Active E2E AI Cloud account with access to Foundation Studio |
| Dataset | A dataset containing model outputs to evaluate, in EOS or on Hugging Face |
| Column names | Know the column names for input, model output, and optionally the reference/ground-truth answer |
| Hugging Face token | Required only if your Hugging Face dataset is private |
Download a sample dataset: here
Step 1: Navigate to Model Evaluation
- In the TIR Dashboard sidebar, click Foundation Studio.
- From the dropdown, select Model Evaluation.
- You will land on the Manage Evaluation Jobs page.
Step 2: Create an evaluation job
Click Create Job or the Click Here button.
Step 3: Configure the input dataset
On the Input Dataset page, fill in the following:
| Field | Description | Example |
|---|---|---|
| Job Name | A clear, descriptive name for the job | tir-job-12181052011 |
| Input Column | Column containing input prompts or questions | question |
| Output Column | Column containing the model's predicted outputs | answer |
| Reference Answer Column | (Optional) Ground-truth answers for comparison | expected_answer |
| Num Rows Limit | Maximum rows to evaluate. Use -1 for no limit | 500 or -1 |
Dataset type: EOS Dataset
- Select EOS Dataset as the dataset type.
- Click Choose to browse available datasets.
- Select the dataset and the specific file to use.
Dataset type: Hugging Face
- Select Hugging Face as the dataset type.
- Enter the Hugging Face dataset name.
- (Optional) If the dataset is private, select an existing Hugging Face integration or click Click Here to create one. Paste your token and click Create.
Click Next to proceed.
Step 4: Select the evaluator model
On the Model Selection page, configure the evaluator:
| Field | Description |
|---|---|
| Evaluator Model | The LLM that will judge your model's outputs |
| Temperature | Controls output randomness (range: 0.0–1.0) |
| Top-P | Nucleus sampling probability (range: 0.001–1.0) |
| Max Tokens | Token limit for the evaluator's scoring output |
Available evaluator models:
| Model | Best for |
|---|---|
| Llama 3.1 8B Instruct | General-purpose evaluation with strong instruction-following |
Parameter guidance:
| Parameter | Conservative | Creative |
|---|---|---|
| Temperature | 0.2 (deterministic) | 1.0 (varied) |
| Top-P | 0.1 (focused) | 1.0 (all tokens) |
| Max Tokens | 512 (short) | 1024 (detailed) |
Click Next to proceed.
Step 5: Select the evaluation framework
On the Framework Selection page:
-
Choose an evaluation framework that matches your task:
Framework Use when Text Summarization Your model generates summaries of documents General Assistant Your model handles general conversation or instruction-following Question Answering Your model answers factual or context-based questions Text Classification Your model classifies text into predefined categories -
Model Evaluation Prompt (optional): Provide additional context or instructions for the evaluator.
- Example:
Please ensure that the summarization does not introduce fabricated details.
- Example:
-
Select a result dataset where scores will be stored. Results are saved to a folder named after your job at the root of the selected EOS bucket.
Step 6: Review and launch
Review your configuration on the Summary page, then click Launch.
The job appears in the Manage Evaluation Jobs list. Processing time depends on dataset size and evaluator model selected.
Step 7: Monitor and review results
Click on the job to view details:
| Tab | What it shows |
|---|---|
| Overview | Job configuration, status, and resource details |
| Events | Pod scheduling and container start events |
| Logs | Real-time job logs to monitor progress or diagnose issues |
| Evaluation Results | Scores across the 4 framework-specific metrics |