Monitoring Training Process

In this section, we outline the training statuses and alerts that we provide, as well as other tools to help you monitor your training.

Statuses

Model training has 3 training stages and 3 possible error types, indicated by the status on the trainings page.

Training Stages

Status: Initializing

Once you start model training, the process will take several minutes to initialize. This includes setting up the instance, preprocessing the images, etc.

2022

Status: Initializing Example (Click image to enlarge)

Status: Training in Progress

Upon successful initialization, model training will commence. You can monitor training performance through the graphs on the page, explained in the following sections.

2022

Status: Training in Progress Example (Click image to enlarge)

Status: Training Completed

Model training is completed. Model performance is displayed on the graphs, explained in the following sections.

2022

Status: Training Completed Example (Click image to enlarge)

Training Errors

🚧

If you see Status: Error Occurred, you should Contact Us!

Status: Out of Memory

2022

Status: Out of Memory Example (Click image to enlarge)

This error occurs when the GPU has insufficient RAM to support model training. Ways to prevent model training error include:

  • Reducing batch size
  • Selecting a GPU with higher RAM
  • Selecting multiple GPUs
  • Choosing a smaller model

πŸ“˜

In general, the default options for each model will not result in model training errors.

Status: Out of Quota

2022

Status: Out of Quota Example (Click image to enlarge)

Model training uses your quota for Compute Minutes. Once you hit the quota, model training will stop, even if your model is still training. Additionally, when your currently used compute minutes added with the compute minutes that are estimated to be used during a training that you are trying to start exceed the quota, the training will not be allowed to run. Therefore, please check your compute minutes and ensure that your usage is not near the maximum quota. If this is not the case for you, please contact us!

Saving trained models uses your quota for Artifacts Stored. Once you hit the quota, future trained models will no longer be saved as artifacts, even if your model is still training.

Click here for information on current usage and quota.


Common Questions

How do I find out more about the data on the graphs?

If you want to better understand the metrics that are being displayed during training, go to Evaluating Model Performance.

How do I change how frequently evaluations are being made?

Go to Training Option : Checkpoint Strategy to see how to change the evaluation interval size.

What if I want to change something in my workflow after training has started?

You cannot change settings for your workflow mid-training. If you are certain that the change is necessary, you should delete the training and go to your workflow to change it.

Are trainings ongoing if I close the platform?

Yes, the training will still continue even if you leave. To find the dashboard for your training again, go to the Training tab on the Project Overview sidebar, where you will be able to see the current status of your training, when it started, the general model framework it is operating on, and the number and type of GPU being used. If you click on the ... button on the bottom right, you can go to the workflow from which the training is running and Delete Training.

2022

Deleting Trainings Example (Click image to enlarge)