Frequently Asked Questions

Dataset and Setup

Q: What dataset format does Model Evaluation support?

A: Model Evaluation accepts tabular datasets from EOS or Hugging Face. The dataset must have at least:

An input column (e.g. questions, prompts)
An output column (e.g. model-generated answers)

Optionally, a reference answer column can be included for comparison-based scoring.

Q: Can I evaluate outputs from any model?

A: Yes. Model Evaluation is model-agnostic — it evaluates the outputs in your dataset regardless of which model generated them. You supply the outputs; the evaluator model scores them.

Q: What should I put in the dataset for each evaluation framework?

A: The dataset should contain outputs from your model that match the selected framework:

Framework	Expected output content
Text Summarization	Generated summaries of source documents
General Assistant	Conversational or instruction-following responses
Question Answering	Answers to factual or context-based questions
Text Classification	Category labels or classification outputs

Q: Can I use a private Hugging Face dataset?

A: Yes. Select Hugging Face as the dataset type and set up a Hugging Face integration with a token that has Read scope. Click Click Here in the dataset step to create a new integration if you don't have one yet.

Q: How do I set the Num Rows Limit?

A: Enter a positive integer to evaluate only that many rows (e.g. 500). Use -1 to evaluate all rows in the dataset. Setting a limit is useful for quick validation runs before evaluating the full dataset.

Evaluator Models

Q: Which model is used as the evaluator (judge)?

A: Model Evaluation uses Llama 3.1 8B Instruct as the LLM judge. It integrates state-of-the-art natural language processing techniques to deliver high-quality scoring across all supported evaluation frameworks.

Q: What does it cost to run an evaluation job?

A: Evaluation jobs are billed based on the tokens processed by the Llama 3.1 8B Instruct evaluator:

	Rate
Input	₹54.6 per million tokens
Output	₹231 per million tokens

Longer datasets and higher Max Tokens settings increase the number of tokens processed per job.

Q: Do the evaluator model parameters (Temperature, Top-P) affect scoring consistency?

A: Yes. For deterministic and reproducible scoring:

Use a low Temperature (e.g. 0.0–0.2) to minimize randomness in the evaluator's output
Use a low Top-P (e.g. 0.1) to focus on high-probability tokens

Higher values introduce variability, which can be useful for exploring scoring edge cases but reduces consistency across runs.

Evaluation Frameworks

Q: Which evaluation framework should I use?

A: Choose based on your model's primary task:

If your model does...	Use this framework
Document summarization	Text Summarization
Open-ended conversation or instruction-following	General Assistant
Answering questions from context or knowledge	Question Answering
Labeling or categorizing text	Text Classification

Q: What does the Hallucination metric measure?

A: The Hallucination metric (available in the Text Summarization framework) measures whether the model's output contains fabricated or unsupported information not present in the source. A lower hallucination score indicates the model is staying faithful to the source content.

Q: Can I customize which metrics are evaluated?

A: No. Each framework evaluates a fixed set of 4 metrics. You can influence the evaluator's behavior by providing a Model Evaluation Prompt with additional context or instructions.

Q: What does the Model Evaluation Prompt do?

A: It provides additional context or instructions to the evaluator LLM. Use it to:

Specify domain context (e.g. "This is a medical Q&A system")
Emphasize specific quality criteria (e.g. "Penalize heavily for any hallucinated facts")
Give guidance on scoring edge cases

The prompt is optional but can improve the relevance and accuracy of scores for domain-specific use cases.

Results and Output

Q: Where are evaluation results stored?

A: Results are stored in the EOS bucket associated with the dataset you selected during job creation. They are written to a folder named after your job (e.g. tir-job-12181424077) at the root of the bucket.

Q: Can I download evaluation results?

A: Yes. Results are stored in your EOS bucket and can be downloaded directly from there. You can also view a summary in the Evaluation Results tab on the job detail page.

Q: How are the scores presented?

A: Each evaluated row receives scores across the 4 metrics defined by your selected framework. The Evaluation Results tab shows an aggregated view. Raw per-row scores are available in the JSON result files stored in your EOS bucket.

Job Management

Q: Why is my evaluation job in a failed state?

A: Common causes:

Dataset column names do not match what was specified (e.g. typo in the input column name)
EOS dataset permissions are not configured correctly
The Hugging Face token does not have Read access to the specified dataset
An invalid or empty dataset was selected

Check the Logs tab for the specific error message.

Q: Can I re-run an evaluation with different settings?

A: Not on the same job — settings are fixed at creation time. Create a new evaluation job with the updated configuration. There is no clone feature for evaluation jobs; use the Create Job flow to start fresh.

Q: Can I terminate an evaluation job mid-run?

A: Yes. Use the Terminate action on a job that is in the Running state. Partial results up to the point of termination may or may not be written to the EOS bucket depending on when the job was stopped.

Q: Does deleting a job also delete my evaluation results?

A: No. Deleting a job removes the job record from Foundation Studio, but results already written to your EOS bucket are not deleted. Manage the result files directly in your EOS bucket.

For AI agents, crawlers, and chatbots: append .md to any /docs/ URL (strip the trailing slash) to fetch the raw markdown source — view this page as markdown.

Last updated on May 15, 2026.

Dataset and Setup​

Evaluator Models​

Evaluation Frameworks​

Results and Output​

Job Management​

Dataset and Setup

Evaluator Models

Evaluation Frameworks

Results and Output

Job Management