Model Evaluation
Assess the quality of any LLM's outputs using automated LLM-as-judge scoring. Choose an evaluation framework, connect your dataset, and get structured quality metrics — no manual annotation required.
Quick Start
What can you do with Model Evaluation?
Evaluate model outputs from any LLM using an automated LLM-as-judge approach
Score outputs across 4 framework-specific metrics per evaluation run
Use datasets from EOS storage or Hugging Face as input
Use Llama 3.1 8B Instruct as the evaluator model (LLM-as-judge)
Store structured results in your EOS bucket for download and analysis
Manage jobs with Retry, Terminate, and Delete actions
Key Characteristics
Approach
LLM-as-Judge Scoring
A capable evaluator LLM scores each output against framework-specific criteria — no manual annotation required at scale.
Data
Flexible Dataset Input
Use EOS datasets or Hugging Face datasets. Specify input, output, and an optional reference answer column.
Evaluator
Llama 3.1 8B Instruct as Judge
Evaluation jobs use Llama 3.1 8B Instruct as the LLM judge to score model outputs against framework-specific criteria.
Best Practices
Best Practices for Model Evaluation
Choose the evaluation framework that reflects your model's actual use case. Using the wrong framework produces irrelevant scores.
Set Num Rows Limit to 100–500 rows to validate your dataset and column configuration before running the full evaluation.
Set Temperature to 0.0–0.2 for deterministic, reproducible scores across repeated evaluation runs.
Providing ground-truth answers enables comparison-based scoring, which produces more accurate results for QA and classification tasks.
API Reference
Model Evaluation API Reference
Programmatically create, list, manage, and delete model evaluation jobs in TIR.
/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/List evaluation jobs/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/Create an evaluation job/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/{job_id}/Get evaluation job details/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/{job_id}/Retry or terminate a job/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/{job_id}/Delete an evaluation job