COMPARATIVE STUDY OF LIGHTWEIGHT OBJECT DETECTION MODELS FOR MOBILE PEDESTRIAN DETECTION
ABSTRACT
Pedestrian detection is a key task in computer vision that supports road safety, especially in smart mobility and driver-assistance systems. Running such systems on mobile or low-power devices presents challenges due to limited resources. This paper presents a comparative study of three lightweight object detection models -YOLOv8n, MobileNet-SSD, and NanoDet - for real-time pedestrian detection on mobile platforms. The models are evaluated using the Caltech Pedestrian Dataset, which contains various real-world pedestrian images. Performance is measured using mean Average Precision (mAP), inference speed (FPS), model size, and error rate. The goal is to analyze the trade-offs between accuracy, speed, and efficiency, and to identify the most suitable model for edge devices. The study offers useful insights for researchers and developers building pedestrian detection systems in real-time, resource-constrained environments such as smart vehicles, surveillance, and robotics.
АННОТАЦИЯ
Обнаружение пешеходов — важная задача компьютерного зрения, способствующая повышению безопасности дорожного движения, особенно в системах умной мобильности и помощи водителю. Запуск таких систем на мобильных или маломощных устройствах требует решения задач, связанных с ограниченными ресурсами. В статье проводится сравнительное исследование трёх легковесных моделей обнаружения объектов — YOLOv8n, MobileNet-SSD и NanoDet — для задач распознавания пешеходов в реальном времени. Модели тестируются на датасете Caltech Pedestrian, содержащем изображения пешеходов в различных условиях. Оценка проводится по метрикам: средняя точность (mAP), скорость обработки (FPS), размер модели и уровень ошибок. Цель — выявить компромиссы между точностью, скоростью и эффективностью, а также определить наиболее подходящую модель для использования на устройствах с ограниченными ресурсами. Результаты представляют практический интерес для разработчиков и исследователей, создающих системы обнаружения пешеходов для умных автомобилей, камер видеонаблюдения и робототехники.
Keywords: pedestrian detection, lightweight neural networks, YOLOv8n, MobileNet-SSD, NanoDet, real-time inference.
Ключевые слова: обнаружение пешеходов, легковесные нейросети, YOLOv8n, MobileNet-SSD, NanoDet, обработка в реальном времени.
Introduction
Pedestrian safety is still a serious problem in today’s transportation systems. According to the World Health Organization (WHO), over 270,000 pedestrians die on roads every year around the world. This is about 22% of all deaths from traffic accidents [1]. In cities, especially in low and middle-income countries, the risk is even higher. This is often because of poor infrastructure and a high number of people walking. Detecting pedestrians early is very important to make roads safer. This is especially true in intelligent transportation systems (ITS), driver-assistance systems (ADAS), and mobile safety apps.
In recent years, computer vision and deep learning have improved a lot. These technologies have helped object detection become more accurate. Some well-known models like Faster R-CNN, YOLO (You Only Look Once), and SSD (Single Shot MultiBox Detector) have shown very good results in detecting pedestrians [2]. However, it is still hard to use these models on mobile devices. This is because mobile systems often have limited memory, lower processing power, and need real-time performance.
To solve this problem, lightweight object detection models have been developed. These models try to give a good balance between accuracy and speed. Some popular lightweight models are MobileNet-SSD, NanoDet, and YOLOv8n, which is the latest version of the YOLO family. These models are often used on devices with fewer resources. Although many papers talk about improving these models or using them for pedestrian detection, not many compare them directly under the same conditions, especially for mobile use [2; 3].
This paper tries to fill that gap by comparing YOLOv8n, MobileNet-SSD, and NanoDet using the Caltech Pedestrian Dataset. We look at important factors like mean Average Precision (mAP), frames per second (FPS), model size, and detection errors. The goal is to understand the trade-offs between speed, size, and accuracy. The results of this study can help people choose the best lightweight model for real-time pedestrian detection on mobile devices. It can also help future work on improving safety applications in real-world systems.
Pedestrian detection has been an important part of computer vision for many years. It is used in areas like self-driving cars, smart city systems, and real-time safety tools. In the beginning, many methods used hand-designed features like Histogram of Oriented Gradients (HOG) and Haar cascades. But these methods often had problems in complex scenes, especially when lighting was poor or when pedestrians were partly hidden. With the rise of deep learning, convolutional neural networks (CNNs) greatly improved the accuracy and reliability of object detection [4; 5].
Later, new models like Faster R-CNN, SSD, and YOLO brought a big change. These models use learned features instead of hand-designed ones. SSD (Single Shot MultiBox Detector) and YOLO (You Only Look Once) became popular because they offer a good balance between accuracy and speed. This makes them useful for real-time applications. One example is MobileNet-SSD, which was designed to work on small devices. It uses a special type of convolution to lower computation costs. It has been used for traffic and pedestrian detection and works well with TensorFlow Lite for mobile use [6].
As the demand for real-time and on-device detection increased, newer lightweight models were created. NanoDet is one of them. It is an anchor-free model that works fast and uses less memory, which is good for edge devices. YOLO also continued to improve through several versions. The latest version, YOLOv8, introduced new features like better feature fusion and flexible design options. Its smallest version, YOLOv8n, is well-suited for mobile deployment [2; 12].
Some researchers have worked on improving YOLOv8 for pedestrian detection. One paper suggested adding RepViT modules and a new detection head to improve detection in crowded scenes. Another paper used FasterNet as the backbone and applied Group Normalization to make the model faster and more accurate at the same time [7; 8].
There is also an older system based on YOLOv3. It was built using TensorFlow and designed for use in vehicles. This showed that it is possible to use these models in real mobile devices [9]. Even though it is older, the idea of using models in real-time and low-power systems is still important.
Many of these studies focus on just one model and do not compare different lightweight models side by side. One study looked at general-purpose models for pedestrian detection and pointed out the differences in accuracy and speed [3]. But this study did not test the models in mobile environments and did not include newer models like YOLOv8n or NanoDet. A recent paper tested YOLOv8 for pedestrian detection and showed good accuracy and speed, but it did not compare it to other models [10].
In short, many papers have worked on improving lightweight models for pedestrian detection, but there is still a lack of fair comparisons focused on mobile use. This study fills that gap by comparing YOLOv8n, MobileNet-SSD, and NanoDet on the Caltech Pedestrian Dataset. We focus on metrics that matter for mobile systems, such as mAP, speed, and model size.
Materials and methods
Dataset. To benchmark lightweight object detection models in real-world pedestrian detection scenarios, we utilize the Caltech Pedestrian Dataset [11]. This dataset is a well-known benchmark for pedestrian detection in urban environments, collected using a vehicle-mounted camera at 640×480 resolution at 30 frames per second. It includes approximately 10 hours of video, 250,000 frames, and over 350,000 labeled pedestrian bounding boxes. These annotations also include temporal correspondence and occlusion indicators, making the dataset suitable for evaluating models under challenging real-world conditions such as motion blur, crowd density, partial occlusions, and varying lighting.
For our experiment, we specifically extract frames from the test portion of the dataset, which corresponds to sets 06 through 10. We sample every 30th frame, yielding a total of 1174 images in .jpg format. This sampling approach ensures temporal diversity in pedestrian poses and densities while keeping the dataset lightweight enough for repeated benchmarking runs on CPU.
Model selection and configuration. In this study, we aim to evaluate and compare the performance of lightweight object detection models suitable for deployment in mobile or resource-constrained environments. The goal is to balance detection accuracy, inference speed, and model size when applied to pedestrian detection in realistic traffic scenarios.
/Turganbek.files/1.png)
Figure 1. Sample images from the Caltech Pedestrian Dataset
We selected three representative models that are widely recognized in the research and industry communities for their efficiency:
YOLOv8n. YOLOv8n is the nano version of the YOLOv8 family developed by Ultralytics. It is a modern anchor-free object detection model that integrates advanced architectural enhancements, such as decoupled heads, feature fusion, and flexible backbone scaling. Operating at an input resolution of 640×640, it offers a strong balance between speed and accuracy, making it an ideal candidate for real-time pedestrian detection. The model is implemented in PyTorch and is evaluated using CPU-only inference to simulate deployment on embedded or mobile systems.
NanoDet-Plus-m-1.5x. NanoDet is a compact, real-time, anchor-free detection model specifically optimized for edge devices [12]. The Plus-m-1.5x variant employs a RepVGG backbone and Group Normalization, along with improved multi-scale feature fusion. It uses a 320×320 input resolution and has a small model size (~4.7 MB), which makes it well-suited for memory-constrained devices. The model was tested using its original PyTorch .ckpt weights on CPU, and inference was conducted using the default NanoDet demo pipeline. While designed for speed, its use of attention modules and deep architecture introduces a measurable performance cost on CPU.
MobileNet-SSD. MobileNet-SSD is a classic single-shot object detector that combines the MobileNet backbone with the SSD detection head [13]. It is implemented using OpenCV’s DNN module, which offers highly optimized C++ inference, making it extremely fast even on CPU. The model uses a 300×300 input resolution and is known for its fast performance and minimal memory footprint. However, MobileNet-SSD is trained on the VOC dataset and not specialized for pedestrian detection. In our experiments, it failed to generalize well to the Caltech Pedestrian Dataset, resulting in frequent false positives (e.g., misclassifying trees as pedestrians) and very limited actual pedestrian detections.
To ensure a fair comparison, all models were evaluated on the same 1174 frames extracted from the Caltech Pedestrian Dataset, using CPU-only inference. No fine-tuning or retraining was performed. Each model was evaluated for its inference speed (FPS), average latency per image, model size, and detection output, while accuracy metrics (mAP) were reported where available.
Evaluation and metrics. To compare the three models, we used a few standard evaluation metrics. These metrics help us understand how accurate each model is and how fast it works. Both accuracy and speed are important for real-world pedestrian detection systems, especially on mobile devices where memory and computing power are limited.
The first metric is mean average precision at 0.5 IoU, written as mAP@0.5. This metric checks if the model correctly finds a pedestrian in the image. For a prediction to be counted as correct, the predicted box and the actual box must overlap by at least 50%. A higher mAP value means the model is more accurate in detecting pedestrians.
The second metric is mAP@0.5:0.95. This is a more difficult test that checks how well the model performs at different overlap levels. It averages the mAP over several thresholds, from 0.5 to 0.95. This shows how stable the model is in different situations, like small or occluded pedestrians.
We also measured how fast each model works. The frames per second (FPS) tells us how many images the model can process in one second. A high FPS means the model is faster, which is good for real-time applications like driver assistance or smartphone apps. Another useful metric is the average inference time per image. This is the time the model takes to process one image. It helps us understand how much delay the system might have in real-time use.
Finally, we looked at the model size in megabytes. A smaller model is easier to run on mobile phones or small devices with limited memory and storage.
All models were tested on the same computer using only the CPU. This gives a fair comparison and simulates how these models would work on real devices.
Results and discussions
All three models were tested on 1174 images from the Caltech Pedestrian Dataset. These images contain real scenes from city traffic, including different weather, lighting, and pedestrian positions. The goal was to see how well each model detects pedestrians and how fast it can do that.
The table below shows the accuracy and speed of each model:
Table 1.
Accuracy and speed comparison of detection models
|
Model |
mAP@0.5 |
mAP@0.5:0.95 |
FPS |
|
YOLOv8n |
0.404 |
0.158 |
5.9 |
|
NanoDet-Plus-m |
0.299 |
0.153 |
3.1 |
|
MobileNet-SSD |
0.08 |
0.04 |
70.6 |
We also measured how long it took each model to process one image and how large the model file is:
Table 2.
Average inference time and model size for each evaluated model
|
Model |
Time/Image (s) |
Size (MB) |
|
YOLOv8n |
0.169 |
6.2 |
|
NanoDet-Plus-m |
0.321 |
4.7 |
|
MobileNet-SSD |
0.014 |
5.0 |
As we can see, YOLOv8n gave the best results overall. It had the highest accuracy and also worked fast enough for real-time use. It is a good choice when both speed and precision are needed. NanoDet-Plus-m had slightly lower accuracy and slower processing time. However, it had the smallest size, which is helpful for mobile or embedded systems where memory is limited. It could be a good option when saving space is more important than getting the highest accuracy. MobileNet-SSD was the fastest model by a large margin. It processed images much faster than the others. But it had very low accuracy and missed most of the pedestrians. It also gave many wrong detections, like confusing trees with people. Because of this, it is not a good choice unless it is retrained with better data.
These results show that there is always a trade-off between accuracy, speed, and size. Choosing the right model depends on the needs of the application.
Conclusion
In this study, we compared three lightweight object detection models to find out which one works best for pedestrian detection on mobile or edge devices. The models we tested were YOLOv8n, NanoDet-Plus-m, and MobileNet-SSD. We used the Caltech Pedestrian Dataset to evaluate how accurate and fast each model is. All tests were done using CPU to simulate real-world usage on low-power devices.
YOLOv8n gave the best results overall. It had the highest accuracy and worked fast enough to be used in real time. This makes it a good choice for mobile applications where both speed and accuracy are important.
NanoDet-Plus-m also performed well. It had slightly lower accuracy than YOLOv8n, and it was slower. However, it had the smallest model size, which is helpful for systems with limited memory. It can be a good choice when space is more important than speed.
MobileNet-SSD was the fastest model. It processed images much faster than the other two. But it had very low accuracy and often gave wrong results, such as detecting trees instead of people. Because of this, it is not suitable for pedestrian detection unless retrained with better data.
Overall, our results show that YOLOv8n is the most balanced model. It gives high accuracy and good speed, making it a strong option for real-time pedestrian detection on mobile devices. NanoDet-Plus-m is a good second choice for memory-constrained environments. MobileNet-SSD is not recommended for this task in its current form.
References:
- World Health Organization. Global status report on road safety 2023. 2023.
- Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. IEEE Conference on Computer Vision and Pattern Recognition, 779–788.
- Tiron, G. Z., & Poboroniuc, M. S. (2020). Benchmarking general-purpose neural networks for real-time pedestrian detection. IEEE, 277–280.
- Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. Retrieved from http://lear.inrialpes.fr
- Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition, 580–587.
- Benhamida, A., Varkonyi-Koczy, A. R., & Kozlovszky, M. (2020). Traffic signs recognition in a mobile-based application using TensorFlow and transfer learning technics. IEEE, 537–541.
- Mao, J., Wang, H., & Li, D. (2024). Optimized and improved YOLOv8 dense pedestrian detection algorithm. IEEE.
- Hu, C., Wei, Y., & Tao, X. (2024). Lightweight YOLOv8 pedestrian detection model based on FasterNet. IEEE, 787–790.
- Zadobrischi, M. N. E. (2020). Pedestrian detection based on TensorFlow YOLOv3 embedded in a portable system adaptable to vehicles. IEEE.
- Dixit, I. A., & Bhoite, S. (2024). Analysis of performance of YOLOv8 algorithm for pedestrian detection. IEEE, 1918–1924.
- Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2009). Caltech pedestrians.
- RangiLyu. (2021). NanoDet-Plus: Super fast and high accuracy lightweight anchor-free object detection model. Retrieved from https://github.com/RangiLyu/nanodet
- Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Retrieved from https://arxiv.org/abs/1704.04861