Quick Start Guide

Create your first Model Evaluation job and score your model's outputs using an automated LLM-as-judge approach. This guide walks through the complete setup flow.

What you need before you start

Requirement	Details
Account	Active E2E AI Cloud account with access to Foundation Studio
Dataset	A dataset containing model outputs to evaluate, in EOS or on Hugging Face
Column names	Know the column names for input, model output, and optionally the reference/ground-truth answer
Hugging Face token	Required only if your Hugging Face dataset is private

Download a sample dataset: here

Step 1: Navigate to Model Evaluation

In the TIR Dashboard sidebar, click Foundation Studio.
From the dropdown, select Model Evaluation.
You will land on the Manage Evaluation Jobs page.

Step 2: Create an evaluation job

Click Create Job or the Click Here button.

Step 3: Configure the input dataset

On the Input Dataset page, fill in the following:

Field	Description	Example
Job Name	A clear, descriptive name for the job	`tir-job-12181052011`
Input Column	Column containing input prompts or questions	`question`
Output Column	Column containing the model's predicted outputs	`answer`
Reference Answer Column	(Optional) Ground-truth answers for comparison	`expected_answer`
Num Rows Limit	Maximum rows to evaluate. Use `-1` for no limit	`500` or `-1`

Dataset type: EOS Dataset

Select EOS Dataset as the dataset type.
Click Choose to browse available datasets.
Select the dataset and the specific file to use.

Dataset type: Hugging Face

Select Hugging Face as the dataset type.
Enter the Hugging Face dataset name.
(Optional) If the dataset is private, select an existing Hugging Face integration or click Click Here to create one. Paste your token and click Create.

Click Next to proceed.

Step 4: Select the evaluator model

On the Model Selection page, configure the evaluator:

Field	Description
Evaluator Model	The LLM that will judge your model's outputs
Temperature	Controls output randomness (range: `0.0–1.0`)
Top-P	Nucleus sampling probability (range: `0.001–1.0`)
Max Tokens	Token limit for the evaluator's scoring output

Available evaluator models:

Model	Best for
Llama 3.1 8B Instruct	General-purpose evaluation with strong instruction-following

Parameter guidance:

Parameter	Conservative	Creative
Temperature	`0.2` (deterministic)	`1.0` (varied)
Top-P	`0.1` (focused)	`1.0` (all tokens)
Max Tokens	`512` (short)	`1024` (detailed)

Click Next to proceed.

Step 5: Select the evaluation framework

On the Framework Selection page:

Choose an evaluation framework that matches your task:

Framework	Use when
Text Summarization	Your model generates summaries of documents
General Assistant	Your model handles general conversation or instruction-following
Question Answering	Your model answers factual or context-based questions
Text Classification	Your model classifies text into predefined categories

Model Evaluation Prompt (optional): Provide additional context or instructions for the evaluator.
- Example: Please ensure that the summarization does not introduce fabricated details.
Select a result dataset where scores will be stored. Results are saved to a folder named after your job at the root of the selected EOS bucket.

Step 6: Review and launch

Review your configuration on the Summary page, then click Launch.

The job appears in the Manage Evaluation Jobs list. Processing time depends on dataset size and evaluator model selected.

Step 7: Monitor and review results

Click on the job to view details:

Tab	What it shows
Overview	Job configuration, status, and resource details
Events	Pod scheduling and container start events
Logs	Real-time job logs to monitor progress or diagnose issues
Evaluation Results	Scores across the 4 framework-specific metrics

Next steps

Features — Explore all evaluation capabilities and framework metrics in detail.
FAQs — Troubleshoot common issues.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

What you need before you start​

Step 1: Navigate to Model Evaluation​

Step 2: Create an evaluation job​

Step 3: Configure the input dataset​

Dataset type: EOS Dataset​

Dataset type: Hugging Face​

Step 4: Select the evaluator model​

Step 5: Select the evaluation framework​

Step 6: Review and launch​

Step 7: Monitor and review results​

Next steps​

What you need before you start

Step 1: Navigate to Model Evaluation

Step 2: Create an evaluation job

Step 3: Configure the input dataset

Dataset type: EOS Dataset

Dataset type: Hugging Face

Step 4: Select the evaluator model

Step 5: Select the evaluation framework

Step 6: Review and launch

Step 7: Monitor and review results

Next steps