Training Option : Checkpoint Strategy

This section introduces the various checkpoint strategy options for training.

The checkpoint strategy controls the frequency for which the platform evaluates the model and the way it decides to save artifacts during the training for that workflow. Each point of evaluation is called a checkpoint. The evaluation interval controls after how many training steps to check whether to save a new artifact i.e. an evaluation interval size of 100 implies that after every 100 steps, it will check to see whether to save a new artifact.

📘

A training step is considered each time model weights are updated, which occurs after every batch of training data is run through the model. Thus, for those more familiar with epochs, the number of training steps can be computed as Number of Epochs * (Dataset Size/Batch Size) = Number of Steps

Why do I have to change the checkpoint strategy?

Smaller evaluation interval sizes will increase the number of times during the training for which you can check on the training progress of your model for evaluation. Naive strategies like always saving the latest checkpoint will also allow for more training artifacts as options.

However, Datature does limit the number of artifacts that one can store. To check more on those limitations, go to Plans and Pricing and Usage Quota.

In order to work around this quota, Nexus provides other strategies that can help you from saving artifacts that aren't to the quality standard that you desire. Currently, we have the following strategies that compare the evaluation performance metrics of the current checkpoint with previous checkpoints.

StrategyEvaluation Interval SizeDescription
Always Saves Latest Checkpoint100, 250, 500, 1000, 1500, 2000, 2500, 5000This saves an artifact after every evaluation interval, so it is the least economical option for the artifact quota.
Lowest Validation Loss100, 250, 500, 1000, 1500, 2000, 2500, 5000This only saves an artifact if the artifact has a lower validation loss than the previously saved artifact.
Highest Accuracy100, 250, 500, 1000, 1500, 2000, 2500, 5000This only saves an artifact if the artifact has a higher accuracy than the previously saved artifact.

You will see the effect of your checkpoint strategy while Monitoring Training Process, which you will be able to see the frequency of evaluation and the statistics at each evaluation as the training progresses.

To see more ways to change your workflow, go to Training Option : Hardware Acceleration and Training Option : Advanced Evaluation.