Video Tracking

Video Recap

  1. Select the Tracking Mode toggle to begin - This will put you in Tracking Mode.
  2. Annotate the objects that you want to track on the frame and then select Proceed - This will confirm how many objects you want to track at once.
  3. Then select how many frames you would like to track until using the video bar, and select Proceed again - This will pick the number of frames for which you would like the object tracker to create tracked annotations for. While the tracking is in progress, a progress bar will appear to indicate the progress of the tracker and the Annotator will be locked.
  4. To refine the tracked annotations, you can navigate to one of the predicted frames and right click to reannotate and provide another ground-truth annotation for the tracker - The tracker also has four keyframe suggestions in red, to indicate where the tracker thinks there could be the most lossy masks.
  5. Once you are happy with the annotations and do not wish to keep refining, you can commit the annotations!

What is Video Interpolation?

Video annotation interpolation techniques are precisely designed to utilize the similarity of visual features between frames to efficiently construct annotations based on just a couple manual annotations. Overall, our tools were designed to provide annotation suggestions in other frames based on a user’s manual annotation. Additionally, as we understand that users and use-cases all require various levels of annotation accuracy, the tools are designed to help users improve the quality of the predictive annotations with additional annotations.

Broadly speaking, video interpolation techniques can be split into computer vision model-based and model-free approaches. Model-free approaches use the manual annotation polygon coordinates to construct a mathematical interpolation for polygons in the frames in between the start and end frames. Model-based approaches utilize the power of machine learning based computer vision models to extract features within the manual annotations and search for similar features in the other frames to automatedly produce annotations. Model-free approaches are generally quite computationally cheap, while model-based approaches will have some level of computational overhead but are much more capable of analyzing the visual features for better predictions.

Model-free interpolation can be considered as a practical context for polygon morphing, a topic that is very common in graphics. As the goal of interpolation in our case is to produce polygons that most easily represent the changing of object shapes from a view over time, our goal with our interpolation tool was to reduce visual anomalies and frequent, large changes in polygon shape throughout the interpolation.

What is Video Tracking?

With all the caveats described above with difficulties in matching non-linear, atypical movements, an AI-assisted tool is a much stronger alternative as it leverages visual features rather than being reliant on polygon coordinate values.

Our AI assisted tool is a video tracker that utilizes an initial annotation on one frame and uses a computer vision model to match the features of the annotation in other frames, and reconstruct annotation masks around them. Visual features can evolve throughout the video, so users are able to re-annotate annotations in other frames. These additional ground-truth labels provide more features that are used in conjunction with each other to improve annotations in the other frames. Notably, the tool is semantic in nature, so multiple polygons can be associated to the same class. When users are annotating hundreds of frames, it can be difficult to tell which frames to be corrected to assist with the predicted annotation quality. As such, the tool also provides suggested keyframes for correction which the model evaluates as the most lossy. Therefore, the annotation process for hundreds of frames is reduced to annotating a few frames to make corrections where needed or suggested.

Video Tracking Capabilities

One benefit of video tracking is its ability to track any number of objects from any number of classes all at once. We do not limit how many annotations you make in your first frame that is inputted for tracking. Therefore, it can help with multi-class, multi-instance tracking, while maintaining highly accurate mask annotation predictions.

Additionally, no AI assistance tool is perfect, so we provide the capability to continuously edit the tracker's predicted masks by inputting annotations from keyframes. This is not bounded in any way, so users can feel free to keep improving the quality in their annotation session until they are satisfied.

Common Questions

Can I input more than one frame before I start tracking?

No, currently, video tracking only supports the input of masks from one frame, as the tracker adjusts the quality of the mask outputs one at a time.

Can I add an additional object on a subsequent round of tracking?

No, you should make sure that the first frame that you input with masks should contain all objects that you are interested in tracking. The tracker will not be able to adjust to additional objects after the first round of inputs.