Skip to main content

Frequently Asked Questions


Dataset and Data Preparation

Q: Can I use a private Hugging Face dataset?

A: Yes. Select HUGGING FACE as the dataset type, then add your Hugging Face token via External Integration. The token needs Read scope. If the dataset belongs to an organization, the token must belong to an account with access to that organization's datasets.


Q: Why did my job fail with a dataset error?

A: Common causes:

  • Incorrect .jsonl format (missing required fields, invalid JSON syntax, or trailing commas)
  • Wrong dataset type for the model (e.g. Stable Diffusion format used with a text model)
  • Empty dataset file
  • EOS bucket permissions blocking access

Validate your .jsonl file locally before uploading. Each line must be valid JSON with the expected fields.


Q: Where is my custom dataset stored?

A: Custom datasets are stored in EOS (E2E Object Storage) Buckets. You can manage datasets directly from the dataset selection step during job creation.


Q: How do I upload a dataset to EOS?

A: During job creation, click CHOOSE to select an existing dataset, or click here to create a new EOS dataset. After creating the dataset, use the UPLOAD DATASET button to add your .jsonl files.


Hugging Face Integration

Q: Why can't I access a Hugging Face model even with a valid token?

A: For gated models (e.g. Llama 3, Mistral), you must:

  1. Visit the model page on Hugging Face.
  2. Accept the model license using the same account that owns your token.
  3. Ensure the token has Read scope.

Downloads will fail regardless of token validity if the license has not been accepted.


Q: Can I use multiple Hugging Face integrations?

A: Yes. You can create multiple integrations (one per token or organization) and select the appropriate one when creating each fine-tuning job.


Q: Do I always need a Hugging Face token?

A: Only if you are using:

  • A gated model (e.g. Llama, some Mistral variants) that requires license acceptance
  • A private Hugging Face dataset

For public models and datasets that are not gated, a token is not required.


Training and Configuration

Q: Can I resume training from a checkpoint?

A: Yes. When creating a job, select Continue training from previous checkpoint and choose the repository and checkpoint to resume from. This is useful for extending training or recovering from an interrupted run.


Q: What is quantization and should I use it?

A: Quantization reduces model precision (e.g. to 4-bit) to lower GPU memory requirements during training. Use it when:

  • Your model is too large for the available GPU in full precision
  • You want to reduce training cost by using a smaller or less expensive GPU

Quantization may slightly reduce model quality compared to full-precision training.


Q: How many epochs should I train for?

A: This depends on dataset size and task. General guidelines:

Dataset sizeSuggested epochs
Small (< 1,000 samples)3–10
Medium (1,000–50,000 samples)1–5
Large (50,000+ samples)1–3

Monitor validation loss in the Training Metrics tab to detect overfitting early.


Q: What is gradient accumulation and when should I use it?

A: Gradient accumulation simulates larger batch sizes by accumulating gradients across multiple steps before updating model weights. Use it when GPU memory limits your per-step batch size. It lets you effectively train with larger batches without running out of memory.


Job Management

Q: Can I change hyperparameters after a job has been created?

A: No. Hyperparameters are fixed at job creation time. Use the Clone feature to create a copy of an existing job and modify the desired parameters.


Q: How long does fine-tuning take?

A: Training time depends on model size, dataset size, GPU type, and the number of epochs configured.


Q: What does the Clone action do?

A: Clone creates a new fine-tuning job pre-filled with the same configuration as the selected job. You can modify any parameters before launching. Useful for hyperparameter experiments, different dataset versions, or continuing from a new checkpoint.


Model Output

Q: Where are the fine-tuned model files stored?

A: In the model repository associated with the job. After training completes, go to the Models tab on the job detail page to access all checkpoints and adapters.


Q: Can I deploy my fine-tuned model as an inference endpoint?

A: Yes. Navigate from the model repository to Inference → Model Endpoints and select your fine-tuned model repository as the model source. See the Inference documentation for full deployment instructions.


Q: What happens to my model if I delete the fine-tuning job?

A: Deleting a job removes the job record and metadata. The model repository and checkpoints may persist depending on your storage configuration. Verify the state of your model repository before deleting a job if you need to retain the trained weights.