Skip to main content

Quick Start Guide

Create your first Model Evaluation job and score your model's outputs using an automated LLM-as-judge approach. This guide walks through the complete setup flow.


What you need before you start

RequirementDetails
AccountActive E2E AI Cloud account with access to Foundation Studio
DatasetA dataset containing model outputs to evaluate, in EOS or on Hugging Face
Column namesKnow the column names for input, model output, and optionally the reference/ground-truth answer
Hugging Face tokenRequired only if your Hugging Face dataset is private

Download a sample dataset: here


Step 1: Navigate to Model Evaluation

  1. In the TIR Dashboard sidebar, click Foundation Studio.
  2. From the dropdown, select Model Evaluation.
  3. You will land on the Manage Evaluation Jobs page.

Step 2: Create an evaluation job

Click Create Job or the Click Here button.


Step 3: Configure the input dataset

On the Input Dataset page, fill in the following:

FieldDescriptionExample
Job NameA clear, descriptive name for the jobtir-job-12181052011
Input ColumnColumn containing input prompts or questionsquestion
Output ColumnColumn containing the model's predicted outputsanswer
Reference Answer Column(Optional) Ground-truth answers for comparisonexpected_answer
Num Rows LimitMaximum rows to evaluate. Use -1 for no limit500 or -1

Dataset type: EOS Dataset

  1. Select EOS Dataset as the dataset type.
  2. Click Choose to browse available datasets.
  3. Select the dataset and the specific file to use.

Dataset type: Hugging Face

  1. Select Hugging Face as the dataset type.
  2. Enter the Hugging Face dataset name.
  3. (Optional) If the dataset is private, select an existing Hugging Face integration or click Click Here to create one. Paste your token and click Create.

Click Next to proceed.


Step 4: Select the evaluator model

On the Model Selection page, configure the evaluator:

FieldDescription
Evaluator ModelThe LLM that will judge your model's outputs
TemperatureControls output randomness (range: 0.0–1.0)
Top-PNucleus sampling probability (range: 0.001–1.0)
Max TokensToken limit for the evaluator's scoring output

Available evaluator models:

ModelBest for
Llama 3.1 8B InstructGeneral-purpose evaluation with strong instruction-following

Parameter guidance:

ParameterConservativeCreative
Temperature0.2 (deterministic)1.0 (varied)
Top-P0.1 (focused)1.0 (all tokens)
Max Tokens512 (short)1024 (detailed)

Click Next to proceed.


Step 5: Select the evaluation framework

On the Framework Selection page:

  1. Choose an evaluation framework that matches your task:

    FrameworkUse when
    Text SummarizationYour model generates summaries of documents
    General AssistantYour model handles general conversation or instruction-following
    Question AnsweringYour model answers factual or context-based questions
    Text ClassificationYour model classifies text into predefined categories
  2. Model Evaluation Prompt (optional): Provide additional context or instructions for the evaluator.

    • Example: Please ensure that the summarization does not introduce fabricated details.
  3. Select a result dataset where scores will be stored. Results are saved to a folder named after your job at the root of the selected EOS bucket.


Step 6: Review and launch

Review your configuration on the Summary page, then click Launch.

The job appears in the Manage Evaluation Jobs list. Processing time depends on dataset size and evaluator model selected.


Step 7: Monitor and review results

Click on the job to view details:

TabWhat it shows
OverviewJob configuration, status, and resource details
EventsPod scheduling and container start events
LogsReal-time job logs to monitor progress or diagnose issues
Evaluation ResultsScores across the 4 framework-specific metrics

Next steps

  • Features — Explore all evaluation capabilities and framework metrics in detail.
  • FAQs — Troubleshoot common issues.