Module : Model

Model Selection and Options

We understand that your projects may have different accuracy requirements, and you may (or may not) be willing to trade accuracy for computational complexity. Therefore we have provided 28 model architectures (and more in the future) ranging from lightweight to highly complex models. Selecting the appropriate model for your use case can help you improve time efficiency and computational cost while balancing high accuracy. The impact is further detailed in Improving Model Performance.

Model Module Options (click to enlarge)

Model Module Options (click to enlarge)

Each model also has options for training parameters.

OptionsInputDescription
Batch SizeAny non-negative power of 2, e.g. 8Number of images or pieces of visual data that your model sees and trains upon at each training step. Your dataset is split up into batches of your predetermined batch size and trains on each batch.
Training StepsAny non-negative integer, recommended to be at least 500Number of times your model trains on your dataset. Each step corresponds to training on a single batch determined by your batch size. The number of training epochs is also shown in green, which indicates the number of times your model trains over your whole dataset.
Max Detections Per ClassAny non-negative integerUpper bound for the maximum number of instances per class that the model can make, so that the model can limit the list of possible output predictions.

Advanced Options

There are also advanced options available for tuning specific hyperparameters.

OptionsInputDescription
Solver / OptimizerChoice of Momentum, SGD, and Adam depending on the model architecture.Algorithm designed to efficiently update the weights of a model during training, typically using gradient descent.
Learning RateAny real number between 0.0001 (1e-5) and 0.1 (1e-1).Step size at which a model's weights are updated during the training process, effectively controlling how quickly or slowly a model learns from its training data. Larger values may quicken the process, but may suffer from non-convergence. Smaller values result in slower convergence, but training results may be sub-optimal if the optimization gets stuck in a local minimum.
MomentumAny real number between 0 and 1Technique used to accelerate the convergence of the training process by smoothing out the variations in the gradient updates over time.
SchedulerWarmCos (more options coming soon!)Technique used to dynamically adjust the learning rate during the training process. It can help to avoid issues like slow convergence, oscillations, and overshooting the optimal parameter values.
Checkpoint SelectionPretrained weights as default with previously trained valid checkpoints listed belowCheckpoint selection allows for previously trained weights of the same model type to be used as the initial weights for a new training. This allows the new training to start off from a stronger baseline and help the model make smoother adjustments to the new dataset.

For Video Classification Model Architectures

OptionsInputDescription
Frame SizeAny integer between 1 and 120.The total number of frames in the frame group that is fed to the model during each step.
Frame StrideAny integer between 1 and 100.The sampling interval of the video when creating frame groups.
Discard ThresholdAny real number between 0.1 and 1.0.Frame groups with a total number of frames smaller than (Frame Size * Discard Threshold) are discarded; those above have their last frame duplicated to meet size requirements.

General Model Selection Tips

In general, models with larger dimensions on the end of the name imply more robust, complex models that are capable of taking in more data and thus more capable of learning more complex features for prediction. If you need higher accuracy and more complexity for your use case, then you should opt for higher dimensionality. However, if compactness and quicker training and inference is more important to you, then you should consider smaller dimensions.

When model names include names like ResNet, MobileNet, or InceptionV2, these represent different backbone models that are responsible for extracting image features such that the rest of the model can utilise these features for their own different processes. As a general trend of the same idea as the above paragraph, MobileNet is the most compact, ResNet is in the middle, and the number next to it, like 50 in ResNet50 indicates how many layers the model has, so the higher the number, the more complex it is. The most complex and robust is InceptionV2.

Model outputs differ based on the task that they are designed to solve. Datature currently offers models for the following tasks:

TaskSubtypeDescriptionOutput
ClassificationImageClassifies images with tags.Outputs class tags.
VideoClassifies videos with tags.Outputs class tags.
Object DetectionIdentifies objects in an image with bounding boxes and class tags.Outputs bounding box coordinates and a class tag for each detected instance.
Semantic SegmentationUsed to describe which regions of pixels correspond to specific classes.Outputs a mask array where each pixel has a value that is associated to a class.
Instance SegmentationUsed to describe which regions of pixels correspond to individual class instances.Outputs a list of polygons with their associated class.
Keypoint DetectionDescribes the pose and structure of an object using groups of keypoints joined together to form a skeleton.Outputs a list of keypoints for each object with their associated class.

Models

Classification

YOLOv8-CLS

YOLOv8 is an extension and improvement upon previous versions of the YOLO family of algorithms that is known for their real-time object detection capabilities. YOLOv8 builds upon the concepts of the original YOLO algorithm, aiming to improve both accuracy and speed. It incorporates advancements such as feature pyramid networks, spatial attention modules, and other architectural improvements to enhance the detection performance. YOLOv8-CLS contains a classification head used for image classification tasks.

ArchitectureResolution
YOLOv8-CLS Nano80x80
320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-CLS Small80x80
320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-CLS Medium80x80
320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-CLS Large80x80
320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-CLS Xtra80x80
320x320
640x640
1280x1280
1600x1600
1920x1920

MoViNet

MoViNet is a family of CNN model architectures with a focus on efficient video recognition, particularly suited for mobile devices. Its design prioritizes computational efficiency while maintaining high accuracy, making it ideal for tasks like real-time video analysis on smartphones and tablets.

ArchitectureResolution
MoViNet A0172x172
MoViNet A1172x172
MoViNet A2224x224
MoViNet A3256x256
MoViNet A4290x290
MoViNet A5320x320

Object Detection

YOLOv9 [New!]

YOLOv9 is an extension and improvement upon previous versions of the YOLO family of algorithms that is known for their real-time object detection capabilities. YOLOv9 builds upon the concepts of the original YOLO algorithm, aiming to improve both accuracy and speed. It incorporates advancements such as Programmable Gradient Information and Generalized Efficient Layer Aggregation Network to enhance the detection performance.

ArchitectureResolution
YOLOv9 Compact
YOLOv9 Extended

YOLOv8

YOLOv8 is an extension and improvement upon previous versions of the YOLO family of algorithms that is known for their real-time object detection capabilities. YOLOv8 builds upon the concepts of the original YOLO algorithm, aiming to improve both accuracy and speed. It incorporates advancements such as feature pyramid networks, spatial attention modules, and other architectural improvements to enhance the detection performance.

ArchitectureResolution
YOLOv8 Nano320x320
640x640
1280x1280
2048x2048
YOLOv8 Small320x320
640x640
1280x1280
2048x2048
YOLOv8 Medium320x320
640x640
1280x1280
YOLOv8 Large320x320
640x640
1280x1280
YOLOv8 Xtra320x320
640x640
1280x1280

RetinaNet

RetinaNet is a one-stage object detection model that utilises a focal loss function to address class imbalance in the training dataset. It has strong performances with dense and small scale objects.

ArchitectureResolution
RetinaResNet50640x640
1024x1024
RetinaResNet101640x640
1024x1024
RetinaResNet152640x640
1024x1024
Retina MobileNetV2320x320
640x640

FasterRCNN

Faster R-CNN introduces a Region Proposal Network (RPN) that shares convolutional features with the detection network, enabling low-cost region proposals. Further, they merge this RPN with Fast R-CNN (another single end-to-end unified object detection network for quick object detection) to achieve high quality, rapid object detection results.

ArchitectureResolution
FasterRCNN ResNet50640x640
1024x1024
FasterRCNN ResNet101640x640
1024x1024
FasterRCNN ResNet152640x640
1024x1024
FasterRCNN InceptionV2640x640
1024x1024

EfficientDet

EfficientDet is another object detection model which uses optimizations and scalable tweaks rather than additional modules to improve object detection. This is a model that is advantageous due to its model efficiency and ability to scale adaptively.

ArchitectureResolution
EfficientDetD1640x640
EfficientDetD2768x768
EfficientDetD3896x896
EfficientDetD41024x1024
EfficientDetD51280x1280
EfficientDetD61408x1408
EfficientDetD71536x1536

YOLOv4 [DEPRECATED]

YOLOv4 is one of the newer one-stage object detection models running on DarkNet, which has achieved more improvements in the tradeoff in speed and accuracy of detection.

ArchitectureResolution
YOLOv4 DarkNet320x320
640x640

YOLOX [DEPRECATED]

YOLOX is an anchor-free version of YOLO, with a simpler design but better performance that makes several modifications to YOLOv3.

ArchitectureResolution
YOLOX Small320x320
640x640
YOLOX Medium320x320
640x640
YOLOX Large320x320
640x640

Semantic Segmentation

DeepLabV3 Semantic Segmentation

DeepLabv3 is a semantic segmentation architecture with improvements to handle the problem of segmenting objects at multiple scales.

ArchitectureResolution
DeepLabV3 ResNet50320x320
640x640
1024x1024
1600x1600
1920x1920
DeepLabV3 ResNet101320x320
640x640
1024x1024
1600x1600
1920x1920
DeepLabV3 MobileNetV3320x320
640x640
1024x1024
1600x1600
1920x1920

UNet Semantic Segmentation

U-Net is a semantic segmentation architecture. It consists of a contracting path and an expansive path that consider both typical features from a convolutional network and a progressively upsampled feature map to improve detail.

ArchitectureResolution
UNet ResNet50320x320
640x640
1024x1024
1600x1600
1920x1920

FCN Semantic Segmentation

Fully Convolutional Network is a semantic segmentation architecture. It exclusively uses locally connected layers, such as convolution, pooling, and upsampling, and avoids the use of dense layers. This makes it faster to train and reduces parameter size.

ArchitectureResolution
FCN ResNet50320x320
640x640
960x960
1280x1280
1600x1600
1920x1920
FCN ResNet101320x320
640x640
960x960
1280x1280
1600x1600
1920x1920

Instance Segmentation

YOLOv8-SEG

YOLOv8 is an extension and improvement upon previous versions of the YOLO family of algorithms that is known for their real-time object detection capabilities. YOLOv8 builds upon the concepts of the original YOLO algorithm, aiming to improve both accuracy and speed. It incorporates advancements such as feature pyramid networks, spatial attention modules, and other architectural improvements to enhance the detection performance.

ArchitectureResolution
YOLOv8-SEG Nano320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-SEG Small320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-SEG Medium320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-SEG Large320x320
640x640
1280x1280
1600x1600
YOLOv8-SEG Xtra320x320
640x640
1280x1280
1920x1920

MaskRCNN

ArchitectureResolution
MaskRCNN InceptionV21024x1024

MaskRCNN is Datature's instance segmentation model designed for predicting segmentation masks using RCNN as the base model.

Keypoint Detection

YOLOv8-Pose

ArchitectureResolution
YOLOv8-Pose Nano320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-Pose Small320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-Pose Medium320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-Pose Large320x320
640x640
1280x1280
1600x1600
1920x1920
YOLOv8-Pose Xtra320x320
640x640
1280x1280
1600x1600
1920x1920