--- title: FAQs --- # Frequently Asked Questions --- ## Dataset and Setup **Q: What dataset format does Model Evaluation support?** A: Model Evaluation accepts tabular datasets from EOS or Hugging Face. The dataset must have at least: - An **input column** (e.g. questions, prompts) - An **output column** (e.g. model-generated answers) Optionally, a **reference answer column** can be included for comparison-based scoring. --- **Q: Can I evaluate outputs from any model?** A: Yes. Model Evaluation is model-agnostic — it evaluates the outputs in your dataset regardless of which model generated them. You supply the outputs; the evaluator model scores them. --- **Q: What should I put in the dataset for each evaluation framework?** A: The dataset should contain outputs from your model that match the selected framework: | Framework | Expected output content | |-----------|------------------------| | Text Summarization | Generated summaries of source documents | | General Assistant | Conversational or instruction-following responses | | Question Answering | Answers to factual or context-based questions | | Text Classification | Category labels or classification outputs | --- **Q: Can I use a private Hugging Face dataset?** A: Yes. Select **Hugging Face** as the dataset type and set up a Hugging Face integration with a token that has **Read** scope. Click **Click Here** in the dataset step to create a new integration if you don't have one yet. --- **Q: How do I set the Num Rows Limit?** A: Enter a positive integer to evaluate only that many rows (e.g. `500`). Use `-1` to evaluate all rows in the dataset. Setting a limit is useful for quick validation runs before evaluating the full dataset. --- ## Evaluator Models **Q: Which model is used as the evaluator (judge)?** A: Model Evaluation uses **Llama 3.1 8B Instruct** as the LLM judge. It integrates state-of-the-art natural language processing techniques to deliver high-quality scoring across all supported evaluation frameworks. --- **Q: What does it cost to run an evaluation job?** A: Evaluation jobs are billed based on the tokens processed by the Llama 3.1 8B Instruct evaluator: | | Rate | |--|------| | **Input** | ₹54.6 per million tokens | | **Output** | ₹231 per million tokens | Longer datasets and higher Max Tokens settings increase the number of tokens processed per job. --- **Q: Do the evaluator model parameters (Temperature, Top-P) affect scoring consistency?** A: Yes. For deterministic and reproducible scoring: - Use a low **Temperature** (e.g. `0.0–0.2`) to minimize randomness in the evaluator's output - Use a low **Top-P** (e.g. `0.1`) to focus on high-probability tokens Higher values introduce variability, which can be useful for exploring scoring edge cases but reduces consistency across runs. --- ## Evaluation Frameworks **Q: Which evaluation framework should I use?** A: Choose based on your model's primary task: | If your model does... | Use this framework | |----------------------|-------------------| | Document summarization | Text Summarization | | Open-ended conversation or instruction-following | General Assistant | | Answering questions from context or knowledge | Question Answering | | Labeling or categorizing text | Text Classification | --- **Q: What does the Hallucination metric measure?** A: The Hallucination metric (available in the Text Summarization framework) measures whether the model's output contains fabricated or unsupported information not present in the source. A lower hallucination score indicates the model is staying faithful to the source content. --- **Q: Can I customize which metrics are evaluated?** A: No. Each framework evaluates a fixed set of 4 metrics. You can influence the evaluator's behavior by providing a **Model Evaluation Prompt** with additional context or instructions. --- **Q: What does the Model Evaluation Prompt do?** A: It provides additional context or instructions to the evaluator LLM. Use it to: - Specify domain context (e.g. "This is a medical Q&A system") - Emphasize specific quality criteria (e.g. "Penalize heavily for any hallucinated facts") - Give guidance on scoring edge cases The prompt is optional but can improve the relevance and accuracy of scores for domain-specific use cases. --- ## Results and Output **Q: Where are evaluation results stored?** A: Results are stored in the EOS bucket associated with the dataset you selected during job creation. They are written to a folder named after your job (e.g. `tir-job-12181424077`) at the root of the bucket. --- **Q: Can I download evaluation results?** A: Yes. Results are stored in your EOS bucket and can be downloaded directly from there. You can also view a summary in the **Evaluation Results** tab on the job detail page. --- **Q: How are the scores presented?** A: Each evaluated row receives scores across the 4 metrics defined by your selected framework. The **Evaluation Results** tab shows an aggregated view. Raw per-row scores are available in the JSON result files stored in your EOS bucket. --- ## Job Management **Q: Why is my evaluation job in a failed state?** A: Common causes: - Dataset column names do not match what was specified (e.g. typo in the input column name) - EOS dataset permissions are not configured correctly - The Hugging Face token does not have Read access to the specified dataset - An invalid or empty dataset was selected Check the **Logs** tab for the specific error message. --- **Q: Can I re-run an evaluation with different settings?** A: Not on the same job — settings are fixed at creation time. Create a new evaluation job with the updated configuration. There is no clone feature for evaluation jobs; use the **Create Job** flow to start fresh. --- **Q: Can I terminate an evaluation job mid-run?** A: Yes. Use the **Terminate** action on a job that is in the **Running** state. Partial results up to the point of termination may or may not be written to the EOS bucket depending on when the job was stopped. --- **Q: Does deleting a job also delete my evaluation results?** A: No. Deleting a job removes the job record from Foundation Studio, but results already written to your EOS bucket are not deleted. Manage the result files directly in your EOS bucket. ---