Object Detection and Tracking in Android (Native C++)- Part 1

Ramesh Pokhrel
13 min readJun 9, 2023

This will be a series of discussions on detection and tracking in Android. I will cover various topics and points that I have encountered throughout the journey, starting from being a non-C++ developer to becoming a proficient C++ developer. I have delved into several subjects of concern and have successfully achieved my project goals.

While conducting my research, I have come across various machine learning terms and concepts.

  1. Object detection: Object detection refers to the task of identifying and localizing objects of interest within an image or a video frame.
  2. Tracking: Object tracking involves following and estimating the position of an object over multiple frames in a video. It typically relies on motion information and the appearance of the object.
  3. Bounding box: A bounding box is a rectangular box that surrounds an object of interest within an image. It is typically represented by four coordinates: the top-left corner (x, y) and the bottom-right corner (x, y).
  4. Non-maximum suppression (NMS): NMS is a post-processing technique used to eliminate redundant and overlapping bounding box detections. It helps select the most accurate and distinct bounding boxes for each object.
  5. Image classification: Image classification is the task of assigning a label or a class to an entire image. It differs from object detection, which involves localizing and classifying multiple objects within an image.
  6. Feature extraction: Feature extraction involves capturing meaningful information or features from the input data, such as images. In object detection, CNNs are commonly used to extract relevant features from images.
  7. Intersection over Union (IoU): IoU is a measure used to evaluate the accuracy of object detection algorithms. It calculates the overlap between the predicted bounding box and the ground truth bounding box. IoU is commonly used in NMS and evaluation metrics like mean Average Precision (mAP).
  8. Benchmark: It involves evaluating the performance of different algorithms, models, or systems by running them on standardized datasets and measuring their performance against established metrics. The results obtained from benchmarking help researchers and developers compare and analyze the effectiveness and efficiency of different approaches.
  9. Blob: A blob is a group of connected pixels in an image that shares some common property ( e.g grayscale value ).

When developers think about Object Detection, their initial research often revolves around Android gradle APIs. They typically explore popular libraries such as OpenCV, TensorFlow, and Google ML Kit. In this discussion, I will highlight the preliminary findings from my research on Native C++, emphasizing the general concepts that developers should be familiar with.

Git Submodule

Git submodules are a feature in Git that allows you to include another Git repository as a subdirectory within your own repository. This is particularly useful when you want to manage external dependencies as submodules to avoid bloating your repository size.

To add a submodule, you can use the following command:

git submodule add --name <name-of-the-module> <repository-url> <path>

For example, if you want to add the android-vulkan NCNN repository as a submodule within your project, you can use the command:

git submodule add --name android-vulkan https://github.com/kanxoramesh/ncnn-20230517-android-vulkan.git app/jni/ncnn-android-vulkan-2023

Java Native Interface(JNI)

The Java Native Interface (JNI) is a programming interface that allows Java code to call and be called by native applications or libraries written in languages such as C or C++. It provides a bridge between the Java virtual machine (JVM) and the native system libraries, enabling Java applications to access system-specific functionality and services that are not available in the Java platform.

The JNI enables Java programs to interact with native code in several ways, including:

  1. Invoking native functions: The JNI allows Java code to call functions written in C or C++ by mapping Java method calls to native function calls.
  2. Accessing native data structures: The JNI enables Java programs to access and manipulate data structures in native code.
  3. Registering callbacks: Native code can register callback functions with the JVM, which can be invoked by Java code in response to certain events.

Add C and C++ code to your Android project by placing the code into a cpp directory in your project module. When you build your project, this code is compiled into a native library that Gradle can package with your app. Your Java or Kotlin code can then call functions in your native library through the Java Native Interface (JNI). To learn more about using the JNI framework, read JNI tips for Android.

Android Studio supports CMake, which is useful for cross-platform projects


An external build tool that works alongside Gradle to compile C and C++ code for your application. Here are the basic steps to use CMake for building Android projects:

  1. Set up the Android NDK
  2. Create a CMakeLists.txt file
  3. Configure CMake
  4. Add source files
  5. Set build options: target_compile_options() and target_link_libraries()
  6. Link with Android libraries: target_link_libraries()


cmake_minimum_required(VERSION 3.10)

set(target target_name)
project(${target} CXX)

set(ncnn_DIR ${CMAKE_SOURCE_DIR}/ncnn-android-vulkan-2023/${ANDROID_ABI}/lib/cmake/ncnn)
find_package(ncnn REQUIRED)


file(GLOB srcs *.cpp *.c)
file(GLOB hdrs *.hpp *.h)


add_library(${target} SHARED ${srcs} ${hdrs} ${sourcefiles})

find_library( log-lib log ) # Library required by NDK.
find_library(android-lib android) # for AssetManager functionality
find_library(jnigraphics-lib jnigraphics)

target_link_libraries(${target} ${ANDROID_OPENCV_COMPONENTS} ncnn camera2ndk mediandk)

NDK Camera

I am using Camera2 API in combination with the NDK (Native Development Kit) in Android. NDK allows you to write native code in C/C++, which can be more efficient and performant compared to Java code in certain scenarios. By utilizing the NDK, you can implement camera-related functionalities in native code, which can lead to faster image processing, lower latency, and improved overall performance. Native code written using the NDK can take advantage of hardware acceleration features provided by the device’s CPU or GPU. It has better performance and higher frame rate.

In CMakeLists.txt I need to add camera2ndk as link library.

target_link_libraries(${target} ${ANDROID_OPENCV_COMPONENTS} ncnn camera2ndk mediandk)

Object Detection Models

1. Yolo(You Only Look Once)

YOLO is a convolutional neural network that is trained to detect objects in images or videos. The system divides an image or video frame into a grid of cells and predicts the presence and location of objects within each cell. This approach allows YOLO to achieve real-time performance, making it useful for applications such as autonomous vehicles, surveillance systems, and robotics. I am using yolo-tiny for mobile, It is compressed version of YOLO designed to train on machines that have less computing power.

2. MobileNetSSD (MobileNet Single Shot MultiBox Detector)

It is an object detection model that combines the MobileNet architecture as a feature extractor with the Single Shot MultiBox Detector (SSD) framework for object detection. It is designed to provide a good trade-off between accuracy and efficiency, making it well-suited for deployment on resource-constrained devices such as mobile phones and embedded systems. I also tried it to test the performance.

Object Detection Frameworks

Either use Tensorflow Lite or NCCN. I will be using NCNN for better performance.

1. Tensorflow Lite

TensorFlow Lite lets you run TensorFlow machine learning (ML) models in your Android apps.

It is a lightweight version of the TensorFlow framework designed for running machine learning models on mobile and embedded devices. It enables developers to deploy TensorFlow models on devices with limited computational resources, such as smartphones, IoT devices, and microcontrollers.

It includes two main components: the TensorFlow Lite Converter and the TensorFlow Lite Interpreter. The converter is used to convert a TensorFlow model into a format that can be deployed on mobile and embedded devices, while the interpreter is used to run the converted model on the device.

implementation 'org.tensorflow:tensorflow-lite-task-vision-play-services:0.4.2'
  • Stand-alone TensorFlow Lite runtime environment
implementation 'org.tensorflow:tensorflow-lite-task-vision:0.4.0'

Tensorflow Interpreter

A Interpreter encapsulates a pre-trained TensorFlow Lite model, in which operations are executed for model inference.

For example, if a model takes only one input and returns only one output:

try (Interpreter interpreter = new Interpreter(file_of_a_tensorflowlite_model)) {
interpreter.run(input, output);


more can be found more on here.

There are some examples already available for object detection in android provided here.

2. NCNN (Neural Network Computer Vision)

NCNN (Neural Network Computer Vision) is an open-source deep learning framework specifically optimized for computer vision tasks. It focuses on efficient implementation and maximum performance on resource-constrained devices. Here’s a general overview of how NCNN works:

  1. Model Conversion: NCNN supports its own model format called the “Parametric Model File” (“.param” file). To use a pre-trained model from other frameworks, such as TensorFlow or Caffe, you need to convert it to the NCNN format using the provided model conversion tools. These tools help translate the model structure, parameters, and weights into the NCNN format.
  1. Model Optimization: Once the model is converted, NCNN provides several optimization techniques to improve performance. This includes model pruning, weight quantization (reducing the precision of weights), and layer fusion (combining multiple layers into a single layer). These optimizations aim to reduce the model size and computation requirements while maintaining accuracy.
  2. Network Definition: In NCNN, you define your neural network using C++ code. You create a network object and add layers to it, specifying the layer type, input/output dimensions, and parameters. NCNN supports various layer types, including convolution, pooling, fully connected, activation, and normalization layers.
ncnn::Mat in = ncnn::Mat::from_android_bitmap_resize(env, bitmap, ncnn::Mat::PIXEL_BGR, 300, 300);
const float mean_vals[3] = {127.5f, 127.5f, 127.5f};
const float norm_vals[3] = {1.0/127.5,1.0/127.5,1.0/127.5};
in.substract_mean_normalize(mean_vals, norm_vals);
  1. Memory Management: NCNN manages memory efficiently to minimize memory consumption. It uses a memory pool mechanism where memory buffers are preallocated and reused for intermediate results during inference. This reduces the overhead of memory allocation and deallocation.
  2. Inference: Once the network is defined and memory is allocated, you can perform inference on input data. You pass the input data to the network, and NCNN takes care of propagating it through the layers, performing computations using optimized algorithms. The output is obtained after the forward pass through the network.
ncnn::Extractor ex = yolo.create_extractor();
ex.input("in0", in_pad);
ncnn::Mat out;
ex.extract("out0", out);
  1. Performance Optimization: NCNN employs various performance optimization techniques to make efficient use of hardware resources. It leverages multi-threading to parallelize computations across multiple CPU cores. Additionally, NCNN takes advantage of SIMD (Single Instruction Multiple Data) instructions available on modern CPUs to perform operations on multiple data elements simultaneously, further improving performance.

yolo.opt = ncnn::Option();
yolo.opt.use_vulkan_compute = use_gpu;
yolo.opt.num_threads = ncnn::get_big_cpu_count();
yolo.opt.blob_allocator = &blob_pool_allocator;
yolo.opt.workspace_allocator = &workspace_pool_allocator;

Object Tracking Frameworks

OpenCV is used for both detection and tracking. I am using it as a tracker.


OpenCV is a computer vision and machine learning software library. As we are going to use object tracking from this library, But currently opencv latest version doesn’t support extra modules like tracking. We need to build OpenCV with extra modules.

OpenCV: https://github.com/opencv/opencv

OpenCv extra modules: https://github.com/opencv/opencv_contrib

We need t setup some configuration to integrate contrib to opencv, please follow these guidelines https://github.com/Mainvooid/opencv-android-sdk-with-contrib/wiki/build-opencv3.4.1-android-sdk-contrib-on-windows

My final opencv build project with opencv-contrib is here.

Now we can use tracking

    cv::Ptr<cv::Tracker> createTrackerByName() {
string trackerTypes[8] = {"BOOSTING", "MIL", "KCF", "TLD", "MEDIANFLOW", "GOTURN", "MOSSE",

string trackerType = trackerTypes[4];
cv::Ptr<cv::Tracker> tracker;
cv::Ptr<cv::legacy::Tracker> trackerLegacy;
bool isLegacy = false;
if (trackerType == "BOOSTING") {
isLegacy = true;
trackerLegacy = cv::legacy::TrackerBoosting::create();
if (trackerType == "MIL")
tracker = cv::TrackerMIL::create();
if (trackerType == "KCF")
tracker = cv::TrackerKCF::create();
if (trackerType == "TLD") {
isLegacy = true;
trackerLegacy = cv::legacy::TrackerTLD::create();

if (trackerType == "MEDIANFLOW") {
isLegacy = true;
trackerLegacy = cv::legacy::TrackerMedianFlow::create();
if (trackerType == "GOTURN")
tracker = cv::TrackerGOTURN::create();
if (trackerType == "MOSSE") {
isLegacy = true;
trackerLegacy = cv::legacy::TrackerMOSSE::create();

if (trackerType == "CSRT")
tracker = cv::TrackerCSRT::create();

if (isLegacy) {
tracker = cv::legacy::upgradeTrackingAPI(trackerLegacy);

return tracker;

auto currentTracker = createTrackerByName();
currentTracker->init(frame, bbox);

We have different tracker from opencv.


Boosting is an algorithm used for object detection and recognition. It combines a set of weak classifiers to obtain a strong classifier. The algorithm trains the classifiers in a sequential manner, and each classifier is trained to focus on the misclassified samples of the previous classifier. Boosting has been widely used in face detection and pedestrian detection applications.

Performance: Boosting has been shown to achieve high accuracy rates in object detection and recognition tasks. However, it is computationally expensive and may not be suitable for real time applications. I only see 3fps to 4fps running. This impacts the performance.

2. MIL (Multiple Instance Learning)

MIL is a technique used in object tracking where an object is represented by a set of instances rather than a single instance. This technique is useful when the object undergoes occlusion or partial occlusion.

Performance: MIL has been shown to be effective in tracking objects in challenging scenarios, such as occlusion and appearance changes. However, it is the same as BOOSTING in performance, it is computationally expensive and may not be suitable. Only run 3fps to 4fps in tracking.

3. KCF (Kernelized Correlation Filter)

KCF is an object tracking algorithm that uses correlation filters and a kernel function to track objects. It is a popular algorithm due to its high speed and accuracy.

Performance: KCF has been shown to be highly accurate and fast, making it suitable for real-time object tracking applications. It has been used in various applications, including human tracking and vehicle tracking.

4. TLD (Tracking-Learning-Detection)

TLD is an object tracking algorithm that combines tracking, learning, and detection to track an object. It is useful when the object undergoes drastic appearance changes.

Performance: TLD has been shown to be effective in tracking objects with significant appearance changes. However, it may not perform well in scenarios with heavy occlusion or when the object moves out of the frame.


MedianFlow is a tracking algorithm that uses the median of the error between the predicted and actual positions of an object to track it. It is a popular algorithm due to its simplicity and robustness.

Performance: MedianFlow is a fast and efficient tracking algorithm that can handle occlusion and significant appearance changes. However, it may not perform well in scenarios with heavy camera motion or when the object is partially occluded.

6. GOTURN (Generic Object Tracking Using Regression Networks)

GOTURN is a deep learning-based object tracking algorithm that uses a convolutional neural network to learn the object features and track it in real-time.

Performance: GOTURN has been shown to achieve state-of-the-art results in object tracking tasks. It can handle occlusion, appearance changes, and scale variations. However, it requires a large amount of training data and may not perform well in scenarios with heavy camera motion.

7. MOSSE (Minimum Output Sum of Squared Error)

MOSSE is a correlation-based object tracking algorithm that uses adaptive correlation filters to track an object. It is a popular algorithm due to its high speed and efficiency.

Performance: MOSSE is a fast and efficient tracking algorithm that can handle occlusion and significant appearance changes. However, it may not perform well in scenarios with heavy camera motion or when the object is partially occluded.

8. CSRT (Channel and Spatial Reliability Tracking)

CSRT is an object tracking algorithm that uses a combination of correlation filters and spatial reliability to track an object. It is an extension of the KCF algorithm and provides better accuracy and robustness.

Performance: CSRT has been shown to achieve high accuracy rates in object tracking tasks, especially in scenarios with heavy occlusion and appearance changes. It is also faster than some other state-of-the-art algorithms. However, it may require more computational resources than other algorithms.

We are using MEDIANFLOW tracking model in this sample project. It seems is best convenient tracking model for the app.

Centroid Based Tracker

Centroid-based object tracking is a technique used in computer vision and image processing to track the movement of an object within a video stream. It involves calculating the centroid, or center of mass, of an object in each frame of the video and using that information to track its movement over time.

To use centroid-based object tracking, the first step is to identify the object of interest within the video stream. Once the object is identified, the centroid of the object is calculated by finding the average x and y coordinates of all the pixels that make up the object.

In subsequent frames of the video, the centroid of the object is calculated again, and the movement of the object is tracked by comparing the centroid of the current frame to the centroid of the previous frame. By analyzing the change in the position of the centroid over time, the direction and speed of the object’s movement can be determined.

  • Objects are detected using a bounding box for the frame at time t-1
  • Calculate the centroids for the object detected for the frame at time t-1.
  • Objects are detected using a bounding box for the frame at time t. Assign a unique ID to the objects
  • Calculate the centroids of the object detected for the frame at time t.
  • Calculate the Euclidean distance between the centroids of all the objects detected in frames t-1 and t.
  • If the distance between the centroid at time t-1 and t is less than the threshold, it is the same object in motion. Hence, use the existing object Id and update the bounding box coordinates of the object to the new bounding box value.
  • If the distance between the centroid at time t-1 and t exceeds the threshold, add a new object id. Update (x, y)-coordinates of existing objects.
  • When objects detected in the previous frame cannot be matched to any existing objects, remove the object id from tracking.

Some drawbacks of centroid tracker

  • Occlusion is seen when the lamp post is blocking the person behind it
  • ID switching can be observed when two similar objects overlap or blend, causing the identities to be switched.
  • Missed detection is observed below, with a person highlighted with a red bounding box not detected by Yolov4 due to background distortion.

This section focuses primarily on the theoretical aspects that are essential to understand before diving into object detection and tracking. In Part II, I will delve into the practical implementation techniques, covering various approaches and methods.