--- title: Features --- # Features > Model Evaluation in Foundation Studio provides automated, framework-specific scoring of model outputs using an LLM-as-judge approach — giving you structured quality signals without manual annotation. --- ## 1. Evaluation Frameworks Model Evaluation supports four task-specific frameworks, each scoring outputs across a fixed set of 4 quality metrics. The evaluator LLM (**Llama 3.1 8B Instruct**) scores each row in your dataset against the metrics defined by the chosen framework. ### Text Summarization Designed for models that generate summaries of source documents. | Metric | What it measures | |--------|-----------------| | **Coherence** | Logical flow and clarity of the summary | | **Conciseness** | Brevity while retaining core meaning | | **Hallucination** | Fabricated or unsupported information not present in the source | | **Informativeness** | Relevance and usefulness of captured content | ### General Assistant Designed for open-ended conversational or instruction-following models. | Metric | What it measures | |--------|-----------------| | **Relevance** | Contextual appropriateness of the response | | **Consistency** | Logical coherence and internal consistency | | **Bias** | Skewed, unfair, or one-sided content | | **Toxicity** | Offensive or inappropriate content | ### Question Answering Designed for models that answer factual or context-based questions. | Metric | What it measures | |--------|-----------------| | **Completeness** | Whether the answer fully addresses the question | | **Correctness** | Factual accuracy of the answer | | **Precision** | Specificity and exactness of the answer | | **Toxicity** | Offensive content in the response | ### Text Classification Designed for models that label or categorize text inputs. | Metric | What it measures | |--------|-----------------| | **Accuracy** | Percentage of correctly classified examples | | **Precision** | Ratio of true positives among predicted positives | | **Recall** | Ability to identify all relevant instances | | **Consistency** | Reliability across similar inputs | --- ## 2. Model Evaluation Prompt An optional free-text prompt that provides additional context or instructions to the evaluator model. Use this to: - Focus the evaluation on specific quality aspects - Add domain context (e.g. "This model is used for medical Q&A") - Instruct the evaluator to penalize specific failure modes **Example:** ``` Please ensure that the summarization does not introduce fabricated details. Penalize heavily for hallucinations even if the summary is otherwise concise. ``` --- ## 3. Results Storage Evaluation results are stored automatically to an EOS bucket of your choice. - Results are written to a folder named after your job (e.g. `tir-job-12181424077`) at the root of the selected dataset bucket. - The **Evaluation Results** tab on the job detail page provides an in-platform view of the scores. - Download raw result files directly from your EOS bucket for further analysis. --- ## 4. Job Monitoring ### Overview Shows job configuration, assigned resources, current status, and metadata. ### Events Tracks pod lifecycle events: scheduling, container initialization, and termination. Useful for diagnosing startup failures. ### Logs Real-time logs from the evaluation job. Use these to monitor progress and diagnose data loading or scoring errors. ### Evaluation Results Displays a structured breakdown of scores across the 4 metrics for the selected framework. Each metric is scored by the evaluator model for every row evaluated. --- ## 5. Job Actions | Action | When to use | State required | |--------|------------|----------------| | **Retry** | Re-run a job that ended in a failed state | Failed | | **Terminate** | Stop a job that is currently running | Running | | **Delete** | Remove a job and its metadata permanently | Any state | **Retry** restarts the job with the original settings — no re-configuration needed. **Delete** removes the job record from Foundation Studio but does not delete results stored in your EOS bucket. ---