---
title: FAQs
---

# Frequently Asked Questions

---

## Dataset and Setup

**Q: What dataset format does Model Evaluation support?**

A: Model Evaluation accepts tabular datasets from EOS or Hugging Face. The dataset must have at least:
- An **input column** (e.g. questions, prompts)
- An **output column** (e.g. model-generated answers)

Optionally, a **reference answer column** can be included for comparison-based scoring.

---

**Q: Can I evaluate outputs from any model?**

A: Yes. Model Evaluation is model-agnostic — it evaluates the outputs in your dataset regardless of which model generated them. You supply the outputs; the evaluator model scores them.

---

**Q: What should I put in the dataset for each evaluation framework?**

A: The dataset should contain outputs from your model that match the selected framework:

| Framework | Expected output content |
|-----------|------------------------|
| Text Summarization | Generated summaries of source documents |
| General Assistant | Conversational or instruction-following responses |
| Question Answering | Answers to factual or context-based questions |
| Text Classification | Category labels or classification outputs |

---

**Q: Can I use a private Hugging Face dataset?**

A: Yes. Select **Hugging Face** as the dataset type and set up a Hugging Face integration with a token that has **Read** scope. Click **Click Here** in the dataset step to create a new integration if you don't have one yet.

---

**Q: How do I set the Num Rows Limit?**

A: Enter a positive integer to evaluate only that many rows (e.g. `500`). Use `-1` to evaluate all rows in the dataset. Setting a limit is useful for quick validation runs before evaluating the full dataset.

---

## Evaluator Models

**Q: Which model is used as the evaluator (judge)?**

A: Model Evaluation uses **Llama 3.1 8B Instruct** as the LLM judge. It integrates state-of-the-art natural language processing techniques to deliver high-quality scoring across all supported evaluation frameworks.

---

**Q: What does it cost to run an evaluation job?**

A: Evaluation jobs are billed based on the tokens processed by the Llama 3.1 8B Instruct evaluator:

| | Rate |
|--|------|
| **Input** | ₹54.6 per million tokens |
| **Output** | ₹231 per million tokens |

Longer datasets and higher Max Tokens settings increase the number of tokens processed per job.

---

**Q: Do the evaluator model parameters (Temperature, Top-P) affect scoring consistency?**

A: Yes. For deterministic and reproducible scoring:
- Use a low **Temperature** (e.g. `0.0–0.2`) to minimize randomness in the evaluator's output
- Use a low **Top-P** (e.g. `0.1`) to focus on high-probability tokens

Higher values introduce variability, which can be useful for exploring scoring edge cases but reduces consistency across runs.

---

## Evaluation Frameworks

**Q: Which evaluation framework should I use?**

A: Choose based on your model's primary task:

| If your model does... | Use this framework |
|----------------------|-------------------|
| Document summarization | Text Summarization |
| Open-ended conversation or instruction-following | General Assistant |
| Answering questions from context or knowledge | Question Answering |
| Labeling or categorizing text | Text Classification |

---

**Q: What does the Hallucination metric measure?**

A: The Hallucination metric (available in the Text Summarization framework) measures whether the model's output contains fabricated or unsupported information not present in the source. A lower hallucination score indicates the model is staying faithful to the source content.

---

**Q: Can I customize which metrics are evaluated?**

A: No. Each framework evaluates a fixed set of 4 metrics. You can influence the evaluator's behavior by providing a **Model Evaluation Prompt** with additional context or instructions.

---

**Q: What does the Model Evaluation Prompt do?**

A: It provides additional context or instructions to the evaluator LLM. Use it to:
- Specify domain context (e.g. "This is a medical Q&A system")
- Emphasize specific quality criteria (e.g. "Penalize heavily for any hallucinated facts")
- Give guidance on scoring edge cases

The prompt is optional but can improve the relevance and accuracy of scores for domain-specific use cases.

---

## Results and Output

**Q: Where are evaluation results stored?**

A: Results are stored in the EOS bucket associated with the dataset you selected during job creation. They are written to a folder named after your job (e.g. `tir-job-12181424077`) at the root of the bucket.

---

**Q: Can I download evaluation results?**

A: Yes. Results are stored in your EOS bucket and can be downloaded directly from there. You can also view a summary in the **Evaluation Results** tab on the job detail page.

---

**Q: How are the scores presented?**

A: Each evaluated row receives scores across the 4 metrics defined by your selected framework. The **Evaluation Results** tab shows an aggregated view. Raw per-row scores are available in the JSON result files stored in your EOS bucket.

---

## Job Management

**Q: Why is my evaluation job in a failed state?**

A: Common causes:
- Dataset column names do not match what was specified (e.g. typo in the input column name)
- EOS dataset permissions are not configured correctly
- The Hugging Face token does not have Read access to the specified dataset
- An invalid or empty dataset was selected

Check the **Logs** tab for the specific error message.

---

**Q: Can I re-run an evaluation with different settings?**

A: Not on the same job — settings are fixed at creation time. Create a new evaluation job with the updated configuration. There is no clone feature for evaluation jobs; use the **Create Job** flow to start fresh.

---

**Q: Can I terminate an evaluation job mid-run?**

A: Yes. Use the **Terminate** action on a job that is in the **Running** state. Partial results up to the point of termination may or may not be written to the EOS bucket depending on when the job was stopped.

---

**Q: Does deleting a job also delete my evaluation results?**

A: No. Deleting a job removes the job record from Foundation Studio, but results already written to your EOS bucket are not deleted. Manage the result files directly in your EOS bucket.


---