Model Evaluation

Assess the quality of any LLM's outputs using automated LLM-as-judge scoring. Choose an evaluation framework, connect your dataset, and get structured quality metrics — no manual annotation required.

LLM-as-JudgeText SummarizationQuestion AnsweringGeneral AssistantText Classification

Quick Start

🚀

Quick Start Guide

Create your first evaluation job and score model outputs in minutes.

↗

✨

Features

Explore evaluation frameworks, scoring metrics, evaluator models, and job management.

↗

💡

FAQs

Get answers to common questions about datasets, evaluator models, and results.

↗

What can you do with Model Evaluation?

Evaluate model outputs from any LLM using an automated LLM-as-judge approach

Score outputs across 4 framework-specific metrics per evaluation run

Use datasets from EOS storage or Hugging Face as input

Use Llama 3.1 8B Instruct as the evaluator model (LLM-as-judge)

Store structured results in your EOS bucket for download and analysis

Manage jobs with Retry, Terminate, and Delete actions

Key Characteristics

Approach

LLM-as-Judge Scoring

A capable evaluator LLM scores each output against framework-specific criteria — no manual annotation required at scale.

Data

Flexible Dataset Input

Use EOS datasets or Hugging Face datasets. Specify input, output, and an optional reference answer column.

Evaluator

Llama 3.1 8B Instruct as Judge

Evaluation jobs use Llama 3.1 8B Instruct as the LLM judge to score model outputs against framework-specific criteria.

Best Practices

Best Practices for Model Evaluation

Match the framework to your task

Choose the evaluation framework that reflects your model's actual use case. Using the wrong framework produces irrelevant scores.

Use a row limit for quick validation

Set Num Rows Limit to 100–500 rows to validate your dataset and column configuration before running the full evaluation.

Use lower temperature for consistent scoring

Set Temperature to 0.0–0.2 for deterministic, reproducible scores across repeated evaluation runs.

Include a reference answer column when available

Providing ground-truth answers enables comparison-based scoring, which produces more accurate results for QA and classification tasks.

API Reference

REST API

</>Model Evaluation API Reference

Programmatically create, list, manage, and delete model evaluation jobs in TIR.

Explore REST APIs

Authentication & Endpoints

Request and Response Schemas

Open API Reference →

tir.e2enetworks.com / api / v1

GET/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/List evaluation jobs

POST/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/Create an evaluation job

GET/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/{job_id}/Get evaluation job details

PUT/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/{job_id}/Retry or terminate a job

DELETE/teams/{Team_Id}/projects/{Project_Id}/evaluation/jobs/{job_id}/Delete an evaluation job

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Quick Start​

What can you do with Model Evaluation?​

Key Characteristics​

LLM-as-Judge Scoring

Flexible Dataset Input

Llama 3.1 8B Instruct as Judge

Best Practices​

Best Practices for Model Evaluation

API Reference​

Quick Start

What can you do with Model Evaluation?

Key Characteristics

Best Practices

API Reference