Model Evaluation
Model evaluation is the process of assessing the performance of a machine learning model to ensure it meets the required objectives. It helps determine whether the model generalizes well to unseen data and satisfies the desired business or technical goals.
How to Create a Model Evaluation?
To initiate the Model Evaluation process, first navigate to the sidebar section and select Foundation Studio. Once selected, a dropdown menu will appear with an option labeled Model Evaluation.
Upon clicking the Model Evaluation option, you will be directed to the Manage Evaluation Jobs page.
On the Manage Evaluation Jobs page, locate and click on the Create Job button or the Click Here button to proceed with creating a Model Evaluation Job.
you will be directed to the Input dataset page. Here, you'll need to name your job name , choose the data type, provide input and output column name and provide the number of row limit.
-
Job Name: Identify the evaluation job with a clear and descriptive name.
- Example:
tir-job-12181052011
.
- Example:
-
Input Column: Specify the column in your dataset that contains the input features for evaluation.
- Example: A column named
question
.
- Example: A column named
-
Output Column: Specify the column in your dataset that contains the model's predicted outputs.
- Example: A column named
answer
.
- Example: A column named
-
Reference Answer Column (Optional): Specify the column that contains the ground truth answers for comparison.
- Optional: This column is not required but is useful for benchmarking the results.
-
Num Rows Limit: Limit the number of rows in the dataset for evaluation.
- Action: Use
-1
for no limit.
- Action: Use
-
Dataset Type
- Options:
-
EOS Dataset: Use an EOS-specific dataset.
-
Action: If you select EOS Dataset, you will need to choose a dataset from the available storage.
- Dataset Selection: Choose your desired dataset from the storage options provided.
Download sample dataset from here.
click on the Choose option to select the EOS Dataset.
Finally, choose the specific EOS Dataset from the list of available data. Select the dataset checkpoint from where you want to Evaluates.
- Dataset Selection: Choose your desired dataset from the storage options provided.
-
-
Hugging Face: Use datasets from the Hugging Face library.
-
Action: If you select Hugging Face, you will need to provide a dataset from the Hugging Face library.
- Hugging Face Dataset: Enter the name of the Hugging Face dataset you want to use or search for a sample dataset.
- Hugging Face Integration (Optional): If the dataset is private, you will need to select an existing integration or click here to create a new integration. This integration is only required if the dataset is private and needs authorization for access.
- In case any HF Token is not integrated, then click on Click Here to create a new Integration.
- Paste the Hugging Face Token and click on Create.
-
-
- Options:
After providing all the information, click on the Next button to proceed to the Model Selection page. Here, you will select the Evaluator Model, and provide values for Temperature, Top-P, and Max Tokens.
-
Evaluator Model: Select the LLM to act as the evaluator for your job.
- Option:
- Llama 3.1 8B Instruct: A state-of-the-art LLM from Meta, offering advanced natural language processing capabilities.
- Llama 3.2 3B Instruct: This low-latency model is perfect for text summarization, classification, and language translation, making it ideal for mobile writing assistants and customer service apps.
- Option:
-
Top-P: Controls token sampling by restricting outputs to the top-p probability distribution.
- Range:
0.001 - 1.0
. - Examples:
0.1
: Focuses on the most likely tokens.1.0
: Includes all token options.
- Range:
-
Temperature: Adjusts the randomness of the output for creativity or determinism.
- Range:
0.0 - 1.0
. - Examples:
0.2
: Produces deterministic outputs.1.0
: Generates more creative outputs.
- Range:
-
Max Tokens: Limits the number of tokens in the generated output.
- Examples:
512
: Short responses.1024
: Detailed responses.
- Examples:
After providing all the required information, click on the Next button to go to the Framework Selection page. Here, you will select the Evaluation Framework (e.g., Text Summarization, General Assistant, Question Answering, or Text Classification) and enter a Model Evaluation Prompt.
-
Evaluation Framework
-
Text Summarization: Evaluate the model’s ability to summarize text effectively.
- Coherence: Measures the logical flow and clarity of the summary.
- Conciseness: Assesses brevity while retaining core meaning.
- Hallucination: Identifies fabricated or false information.
- Informativeness: Evaluates relevance and usefulness of information.
-
General Assistant: Assess the general assistant capabilities of the model.
- Relevance: Measures contextual relevance.
- Consistency: Evaluates logical coherence.
- Bias: Detects skewed or unfair content.
- Toxicity: Assesses offensive or inappropriate content.
-
Question Answering: Evaluate the model's ability to provide accurate and complete answers.
- Completeness: Checks if the answer fully addresses the question.
- Correctness: Assesses factual accuracy.
- Precision: Measures specificity and exactness.
- Toxicity: Identifies offensive content in responses.
-
Text Classification: Assess the model’s classification performance.
- Accuracy: Measures the correctness of predictions.
- Precision: Evaluates true positives among predicted positives.
- Recall: Assesses the ability to capture relevant instances.
- Consistency: Checks reliability across similar inputs.
-
-
Model Evaluation Prompt (Optional): Provide additional instructions or context for evaluation.
- Example:
Please ensure that the summarization does not introduce fabricated details.
- Example:
- Evaluation Result: This dataset will be used for storing model evaluation results. The results will be stored in a folder like
tir-job-12181424077
, located at the root directory of the EOS bucket associated with the selected dataset.
Select an existing dataset or Click here to create a new dataset.
Once the debug option has been thoroughly addressed, users are required to preview their selections on the summary page.
Model Evaluation Actions
Retry Job
If a Model Evaluation job ends up in the "Failed" state, it can be reattempted by utilizing the Retry action. This allows the system to reprocess the job without the need for manual intervention.
Terminate Job
A Model Evaluation job in the "Running" state can be terminated at any time if required. This action allows users to stop the ongoing process, providing flexibility to manage resources or address any unexpected issues.
Delete Job
The Model Evaluation job can be deleted regardless of its current state. This ensures that users have the flexibility to remove the job at any point, whether it is in progress, failed, or completed.
Model Evaluation Details
Overview
In the Overview section, you can review the details of the Model Evaluation job.
Events
In the Events section, you can monitor recent pod activities such as scheduling and container start events.
Logs
Model Evaluation logs provide detailed information about the job process, allowing users to monitor progress, diagnose issues, and optimize performance effectively. They serve as a comprehensive record of the job process.
Evaluation Results
The Evaluation Results section provides a detailed analysis of your model's performance based on the selected evaluation framework. The results include four key metrics tailored to the specific framework chosen during the evaluation setup.