Frequently Asked Questions
Dataset and Setup
Q: What dataset format does Model Evaluation support?
A: Model Evaluation accepts tabular datasets from EOS or Hugging Face. The dataset must have at least:
- An input column (e.g. questions, prompts)
- An output column (e.g. model-generated answers)
Optionally, a reference answer column can be included for comparison-based scoring.
Q: Can I evaluate outputs from any model?
A: Yes. Model Evaluation is model-agnostic — it evaluates the outputs in your dataset regardless of which model generated them. You supply the outputs; the evaluator model scores them.
Q: What should I put in the dataset for each evaluation framework?
A: The dataset should contain outputs from your model that match the selected framework:
| Framework | Expected output content |
|---|---|
| Text Summarization | Generated summaries of source documents |
| General Assistant | Conversational or instruction-following responses |
| Question Answering | Answers to factual or context-based questions |
| Text Classification | Category labels or classification outputs |
Q: Can I use a private Hugging Face dataset?
A: Yes. Select Hugging Face as the dataset type and set up a Hugging Face integration with a token that has Read scope. Click Click Here in the dataset step to create a new integration if you don't have one yet.
Q: How do I set the Num Rows Limit?
A: Enter a positive integer to evaluate only that many rows (e.g. 500). Use -1 to evaluate all rows in the dataset. Setting a limit is useful for quick validation runs before evaluating the full dataset.
Evaluator Models
Q: Which model is used as the evaluator (judge)?
A: Model Evaluation uses Llama 3.1 8B Instruct as the LLM judge. It integrates state-of-the-art natural language processing techniques to deliver high-quality scoring across all supported evaluation frameworks.
Q: What does it cost to run an evaluation job?
A: Evaluation jobs are billed based on the tokens processed by the Llama 3.1 8B Instruct evaluator:
| Rate | |
|---|---|
| Input | ₹54.6 per million tokens |
| Output | ₹231 per million tokens |
Longer datasets and higher Max Tokens settings increase the number of tokens processed per job.
Q: Do the evaluator model parameters (Temperature, Top-P) affect scoring consistency?
A: Yes. For deterministic and reproducible scoring:
- Use a low Temperature (e.g.
0.0–0.2) to minimize randomness in the evaluator's output - Use a low Top-P (e.g.
0.1) to focus on high-probability tokens
Higher values introduce variability, which can be useful for exploring scoring edge cases but reduces consistency across runs.
Evaluation Frameworks
Q: Which evaluation framework should I use?
A: Choose based on your model's primary task:
| If your model does... | Use this framework |
|---|---|
| Document summarization | Text Summarization |
| Open-ended conversation or instruction-following | General Assistant |
| Answering questions from context or knowledge | Question Answering |
| Labeling or categorizing text | Text Classification |
Q: What does the Hallucination metric measure?
A: The Hallucination metric (available in the Text Summarization framework) measures whether the model's output contains fabricated or unsupported information not present in the source. A lower hallucination score indicates the model is staying faithful to the source content.
Q: Can I customize which metrics are evaluated?
A: No. Each framework evaluates a fixed set of 4 metrics. You can influence the evaluator's behavior by providing a Model Evaluation Prompt with additional context or instructions.
Q: What does the Model Evaluation Prompt do?
A: It provides additional context or instructions to the evaluator LLM. Use it to:
- Specify domain context (e.g. "This is a medical Q&A system")
- Emphasize specific quality criteria (e.g. "Penalize heavily for any hallucinated facts")
- Give guidance on scoring edge cases
The prompt is optional but can improve the relevance and accuracy of scores for domain-specific use cases.
Results and Output
Q: Where are evaluation results stored?
A: Results are stored in the EOS bucket associated with the dataset you selected during job creation. They are written to a folder named after your job (e.g. tir-job-12181424077) at the root of the bucket.
Q: Can I download evaluation results?
A: Yes. Results are stored in your EOS bucket and can be downloaded directly from there. You can also view a summary in the Evaluation Results tab on the job detail page.
Q: How are the scores presented?
A: Each evaluated row receives scores across the 4 metrics defined by your selected framework. The Evaluation Results tab shows an aggregated view. Raw per-row scores are available in the JSON result files stored in your EOS bucket.
Job Management
Q: Why is my evaluation job in a failed state?
A: Common causes:
- Dataset column names do not match what was specified (e.g. typo in the input column name)
- EOS dataset permissions are not configured correctly
- The Hugging Face token does not have Read access to the specified dataset
- An invalid or empty dataset was selected
Check the Logs tab for the specific error message.
Q: Can I re-run an evaluation with different settings?
A: Not on the same job — settings are fixed at creation time. Create a new evaluation job with the updated configuration. There is no clone feature for evaluation jobs; use the Create Job flow to start fresh.
Q: Can I terminate an evaluation job mid-run?
A: Yes. Use the Terminate action on a job that is in the Running state. Partial results up to the point of termination may or may not be written to the EOS bucket depending on when the job was stopped.
Q: Does deleting a job also delete my evaluation results?
A: No. Deleting a job removes the job record from Foundation Studio, but results already written to your EOS bucket are not deleted. Manage the result files directly in your EOS bucket.