Tracking with Deepsort

Ramesh Pokhrel
4 min readAug 1


Let’s cover some common terms I used in Deep Learning.

Object detection

Process of identifying and localizing objects of interest within an image or a video from some detection models. e.g. Yolo, tflite model , MobileNet, FasterRCNN.

I am using yolov7 with ncnn android Vulkan deep learning framework. Yolo is Grid-based Detection that divides the image into small grids and predicts the bounding box and class for each grid.

Kalman filter

It is the way to predict the next position of the object. it’s a correlation of frame to frame, tracking based on position and velocity, and predicting where it is likely to be on the next frame, which is better than centroid-based tracking. It Kalman Filter is a straightforward concept, It,s just an essential mathematical tool used for estimating the state of a dynamic system from a series of noisy measurements over time. It also calculates the presence of uncertainty but mainly focused on linear patterns, which is not realistic in many real-life situations.

Hungarian (Munkres assignment)

It is also called Bipartite Graph Matching.

A graph G(V,E) is bipartite if the nodes can be partitioned into two independent sets A and B such that every edge in the graph connects a node in the set A and a node in the set B.

It is used to find the optimal assignment that minimizes the total cost or maximizes the total profit. We have nxn cost matrix with the detections and tracks. we will not check the brute force algorithm here, go check every node and calculate them Instead, we will use some cost matrix mechanism.

we can assign cost with Euclidean distance of the centers of the boxes OR maximum IOU of boxes OR visual similarities.

Euclidean Distance

Distance between the centroid of detected bounding box and track. d = √[(x2 — x1)2 + (y2 — y1)2]

what if the object changes shape? like smaller to bigger? It fails in this condition.

IOU (Intersection Over Union)

It is also called segmentation algorithms. It measures the overlap between the predicted bounding box (or segmented region) and the ground truth bounding box (or actual bounding box) of an object.

IoU = (Area of Intersection) / (Area of Union)
IOU: 0.1

Convolutional Cost (feature extraction)

It looks inside the bounding box. Extract feature from there which is used in deep-sort tracking. Check similarity and dissimilarity. We calculate the cost of how much they have similarity and how much they have dissimilarity. I am using onnx(open neural network exchange) feature extraction with onnxruntime framework.

ONNX Runtime is an open-source, high-performance inference engine developed by Microsoft for running models that are compatible with the Open Neural Network Exchange (ONNX) format. It loads the Onnx Inference model, initializes the inference session, inputs the detected bounding box, performs inference, and gives output.

Simple Online and Realtime Tracking(SORT)

It is one of the first algorithms to handle Object Tracking. It has mainly three steps


From some detection models. Yolo, Tensorflow, MobileNet, FasterRCNN. I am using yolov7 with ncnn deep learning framework. Yolo is Grid-based Detection that divides the image into small grids and predicts the bounding box and class for each grid.


It uses the Kalman filter to predict the next position.


Hungarian algorithm will be used for the ID association problem. It helps to assign the ID to newly detected objects and tracks with minimizing the global cost. The cost might be Euclidean distance or Intersection over Union (IOU).

Why Deep SORT?

Deep sort is the successor of the SORT algorithm. The SORT algorithm creates too many ID switches when objects go through Obstacles and sight of blocks. Deepsort introduces a deep learning model with a linear Kalman filter to minimize the Challenges in object detection and tracking including occlusion, appearance changes, and real-time processing requirements.

We will add Convolutional Cost (feature extraction) to the SORT algorithm to enhance tracking in Deep SORT. Now the cost matrix is a combination of the motion model (Kalman filter) and visual similarity (Deep Neural Network). Extended Kalman filter now can handle more noisy and ambiguous object detections.

Track Objects

In each frame, compare existing Kalman tracks with newly detected objects and solve the ID assignment from the feature extractor to count the tracks.

Count Objects

We can simply count the number of tracked objects from the Kalman tracklist. By setting a Time to Live for each detected object, we can remove them after that period.

Thank you :)



Recommended from Medium


See more recommendations