Track the history of your model finetuning runs under the Models tab on the left side navigation bar.

Finetuning Run Page

Click on a finetuning run to see detailed information about the training run, including the dataets used for training, hyperparameter settings, and evaluation metrics.

Some details about the various sections on this page:

Availability:

Shows the availability status of the model. If the training is still in progress, or failed for some reason, the model will be marked as unavailable. If the training is successful, the model will be marked as available.

Workflow Steps:

Any finetuning run will go through the following steps:

  1. Initialization: Initial setup phase for the model fine-tuning.
  2. Dataset Preparation: During this step, the training dataset is collected and prepared with the correct formatting and few-shot examples.
  3. Resource Allocation: Infrastructure (GPU) resources and provisioned for the training run.
  4. Training: During this step, the model weights are tuned using the training dataset. For most finetuning tasks, this is the step that takes the longest.
  5. Deployment: Once the training completes, the new model weights are deployed online for use in your tasks.
  6. Evaluation: During this step, the model’s output quality is evaluated on a heldout evaluation dataset.

Training Progress

Visual representation of the training and evaluation loss over time, showcasing the model’s learning curve and performance improvement.

Refuel will automatically save model checkpoints at regular intervals during training. At the end, we will select the best checkpoint (i.e. the one with lowest loss on the validation set) for deployment.

Evaluation Summary

A graphical representation (bar chart) of the model’s evaluation metrics:

  • Accuracy: Not displayed in the graph.
  • F1 Score: Measures a balance between precision and recall.
  • Precision: Indicates the percentage of relevant instances correctly identified by the model.
  • Recall: Measures the model’s ability to capture all relevant instances.

Compares the performance of the fine-tuned model against a baseline or preview model (e.g., gpt-4-1106-preview).