Master's student, School of Information Technology and Engineering, Kazakh-British Technical University, Kazakhstan, Almaty
EFFICIENT IMAGE RECOGNITION ON MOBILE DEVICES: ALGORITHMS, CHALLENGES AND SOLUTIONS
УДК 004.6
ABSTRACT
The increasing popularity of intelligent mobile applications resulted in a high demand for image recognition capabilities in mobile devices. Nevertheless, integrating sophisticated computer vision architectures into mobile platforms continues to pose substantial engineering challenges, primarily attributed to limited computational capacity, power consumption constraints, and the stringent demands of real-time operation. This work presents a systematic comparative study of contemporary image recognition methodologies adapted for mobile deployment. The investigation encompasses lightweight convolutional neural network (CNN) architectures - notably MobileNet and EfficientNet family models - in conjunction with established model compression strategies, including quantization and weight pruning. Hardware-aware optimization techniques exploiting mobile graphics processing units (GPUs) and dedicated neural processing units (NPUs) are further examined.
For empirical validation, a proof-of-concept mobile application was implemented and evaluated across a heterogeneous set of devices representing diverse hardware configurations. Performance characterization was conducted along three principal dimensions: inference latency, classification accuracy, and energy consumption. The experimental outcomes indicate that applying model optimization strategies yields latency and energy reductions in the range of 40–60%, while incurring only marginal accuracy loss of approximately 1–3%, with variation subject to the specific model architecture and target hardware platform.
АННОТАЦИЯ
Стремительное распространение интеллектуальных мобильных приложений существенно усилило потребность в системах распознавания изображений, функционирующих непосредственно на устройстве. Тем не менее интеграция сложных архитектур компьютерного зрения в мобильные платформы по-прежнему представляет значительные инженерные трудности, обусловленные прежде всего ограниченной вычислительной мощностью, ограничениями энергопотребления, а также жёсткими требованиями к работе в режиме реального времени.
В данной работе представлено систематическое сравнительное исследование современных методов распознавания изображений, адаптированных для развёртывания на мобильных устройствах. В рамках исследования рассматриваются облегчённые архитектуры свёрточных нейронных сетей (СНС) - в частности, модели семейств MobileNet и EfficientNet - в совокупности с апробированными методами компрессии моделей, включая квантование и прореживание весов. Дополнительно анализируются аппаратно-ориентированные методы оптимизации, использующие мобильные графические процессоры (GPU) и специализированные нейронные процессоры (NPU).
Для эмпирической верификации была реализована экспериментальная мобильная версия приложения, протестированная на гетерогенном наборе устройств с различными аппаратными конфигурациями. Оценка производительности проводилась по трём ключевым параметрам: задержка вывода, точность классификации и энергопотребление. Результаты экспериментов свидетельствуют о том, что применение методов оптимизации моделей обеспечивает снижение задержки и энергопотребления в диапазоне 40–60%, при этом потери точности составляют лишь около 1–3% и варьируются в зависимости от конкретной архитектуры модели и целевой аппаратной платформы.
Keywords: Image recognition, Lightweight CNN, Model Quantization, Edge Computing, MobileNet, EfficientNet
Ключевые слова: Распознавание изображений, Свёрточные нейронные сети, Квантование модели, Прореживание весов, Нейронный процессор (NPU), Задержка вывода, Энергоэффективность
Introduction
With mobile devices accounting for more than 90% of global internet access, the need for efficient on-device processing-particularly in image recognition-has become critical [1]. However, limited computational resources and energy constraints pose significant challenges to deploying deep learning models on mobile platforms. High-performing image recognition algorithms are designed for high-performance computing environments involve considerable pro-processing capacities, making them unavailable for cell phones with limited computational resources, battery constraints and real-time processing needs. When deep learning models such as convolutional neural networks (CNNs) have attained re-remarkable accuracy [2], use in mobile devices tends to create issues such as high memory, high-latency consumption, energy wastage. Current optimization methods, including model quantization, pruning, and edge computing, as have been, but a gap remains in achieving a balance among resource efficiency, speed, and accuracy efficiency [3]. Accordingly, there’s a critical need to develop improved image recognition algorithms developed for mobile phones that make them light, efficient, able to offer excellent performance at a price accuracy.
Image recognition has evolved significantly with advancements in deep learning, computer vision, and mobile computing. Traditional methods relied on handcrafted features such as SIFT and HOG [4], but deep learning, particularly convolutional neural networks (CNNs), has demonstrated superior performance in tasks like object detection and classification. However, deploying these models on mobile devices poses challenges due to limited computational resources, memory constraints, and power efficiency.
Optimization techniques have been developed to address these limitations. Model compression methods, including quantization and pruning, reduce the size and complexity of deep learning models while maintaining accuracy [5]. Knowledge distillation, where a smaller model learns from a larger, pre-trained model, is another strategy to optimize inference on mobile platforms [6]. Hardware-aware neural architecture search (NAS) has also been explored to design efficient models tailored to specific mobile processors [7].
Edge computing and federated learning offer alternative solutions by shifting computations to edge devices or distributing model training across multiple users without centralizing data [8]. These approaches improve privacy and reduce latency, making them viable for mobile-based image recognition applications.
Despite these advancements, challenges remain in balancing model accuracy, speed, and energy consumption. Recent studies have explored hybrid approaches combining cloud and edge processing to optimize real-time recognition on mobile devices [9]. In other studies [10] and [11] addressing these issues is crucial for the seamless integration of image recognition in applications such as augmented reality, autonomous navigation, and mobile healthcare. This research aims to develop and optimize lightweight, high-performance image recognition algorithms tailored for mobile environments.
Significant progression in image recognition, existing algorithms often struggle with efficiency on mobile devices due to high computational demands and energy consumption. Many state-of-the-art models are optimized for powerful hardware but fail to deliver real-time performance on resource constrained environments. Compression techniques, such as model pruning and quantization, can reduce size but may degrade accuracy, limiting their effectiveness for complex image recognition tasks.
Furthermore, current approaches often lack adaptability to diverse mobile hardware architectures, leading to inconsistent performance across different devices. Most solutions rely on cloud-based processing to offload computational load, which introduces latency, privacy concerns, and dependency on network connectivity. Addressing these gaps requires novel optimization strategies that enhance efficiency without sacrificing accuracy, ensuring robust performance directly on mobile devices.
In this research, we aim to develop an optimized image recognition algorithm for mobile devices by integrating model pruning, quantization, and knowledge distillation techniques. We will use a modified MobileNetV2 architecture, fine-tuned for low-power consumption and real-time performance. Our hypothesis is that by applying structured pruning and 8-bit quantization, we can reduce the model size by at least 50% while maintaining over 80% accuracy on the ImageNet dataset.
To evaluate the effectiveness of my approach, we will compare the optimized model against baseline architectures (such as EfficientNet and MobileNet) using key performance metrics: accuracy, inference speed (measured in FPS), and energy consumption (measured in joules per inference) on ARM-based mobile processors. The expected outcome is a model that achieves at least 30 FPS on mid-range mobile hardware while consuming 40% less energy than standard deep learning models.
Materials and methods
In this paper, we aim to evaluate lightweight image recognition models optimized for mobile deployment. The methodology includes four main stages: dataset preparation, model selection and optimization, mobile implementation, and performance evaluation. The general workflow of the methodology is illustrated in Figure 1.
/Nagymetzhan.files/image001.png)
Figure 1. General workflow of the methodology
Dataset
We used a reduced version of the ImageNet ILSVRC 2012 dataset, consisting of 100 selected classes to accommodate mobile constraints and ensure consistent testing. Each class contains 500 images resized to 224×224 resolution. The summary of the image recognition dataset is presented in Table 1.
Table 1.
Summary of Selected Monitoring Stations
|
Dataset |
Classes |
Images/Class |
Total Images |
Image Size |
|
Custom ImageNet- 100 |
100 |
500 |
50,000 |
224×224 px |
In order to support efficient testing of image recognition models in mobile resource constraints, we prepared a tailored subset of ImageNet ILSVRC 2012. There exist over 14 million labeled images from 1,000 object classes in ImageNet. While such scale is convenient to use in training big models, it’s very difficult to deploy with mobile constraints in terms of memory, processing, as well as power. To handle these constraints, we used a balanced subset of 100 semantically unique classes from the base ImageNet set to represent common objects in everyday mobile usage (e.g., pets, kitchen appliances, automobiles). We then randomly chose a total of 500 images per class, for a total of 50,000 images. All images have been resized to a uniform resolution of 224×224 pixels, which also serves as an optimized input resolution if utilizing efficient CNN models such as MobileNet or EfficientNet. The resizing allows for a uniform standard input for all models as well as meeting pre-trained model requirements.
Algorithm selection
Because of the constrained resources in terms of processing power, memory, as well as power consumption, of a mobile device, it becomes essential to select and optimize appropriate deep learning models to enable real-time image classification. The first step was to carefully choose among the most updated CNN architectures, which are widely renowned for their efficiency in terms of parameter size, computational cost, and inference time while not sacrificing decent classification accuracy. 1) MobileNetV2: This architecture employs depth wise separable convolutions and linear bottlenecks to significantly reduce the number of parameters and computations, making it ideal for mobile applications [12]. 2) ShuffleNetV2: Known for its high speed and accuracy trade-off, this model uses channel shuffling and pointwise group convolutions to improve memory efficiency and computational speed. 3) EfficientNet-Lite0: A mobile-optimized variant of the EfficientNet family, this model uses compound scaling to balance depth, width, and resolution, achieving strong accuracy while maintaining low inference latency.
We selected three CNN architectures commonly used in mobile environments (Table 2):
Table 2.
Selected Lightweight CNN Models
|
Model |
Base Parameters (M) |
Size(MB) |
Pretrained Source |
|
MobileNetV2 ShuffleNetV2 EfficientNet- Lite0 |
3.4 5.0 5.4 |
14 13 17 |
TFlow Hub PyTorch Hub TFlow Lite |
Optimization techniques
To further improve the performance of these models on mobile devices, we applied a series of optimization techniques.
Post-Training Quantization:
All of the models have been quantized from 32-bit floating point (FP32) to 8-bit integer (INT8) precision through post-training quantization. This re- duces model size and accelerates inference with generally no impact to accuracy. Algorithm 1 allows a previously trained neural network to be converted to a smaller, faster, more energy-efficient variant by leveraging 8-bit integer quantization.
Algorithm 1 Post-Training Quantization
1: Input: Trained TensorFlow model
2: Output: Quantized TFLite model
3: Load the pre-trained model
4: Define a representative dataset for calibration
5: Initialize the TFLiteConverter
6: Set optimization parameters:
7: converter.optimizations = [tf.lite.Optimize.DEFAULT]
8: converter.representative_dataset = representative_data
9: converter.target_spec.supported_types = [tf.int8]
10: Convert the model using converter.convert()
11: Save the quantized model to disk
Pruning: Algorithm 2 removes entire neurons, filters, or channels from a network based on their low importance, reducing model size and computation. This conserves memory and computations, especially useful in convolution-dense models such as MobileNetV2.
Algorithm 2 Structured Pruning
1: Input: Pre-trained model, training dataset
2: Output: Pruned model
3: Load pre-trained model
4: Apply pruning wrapper:
5: model = prune_low_magnitude(model, pruning_schedule)
6: Compile the pruned model
7: Train the model on training dataset
8: Strip pruning wrappers:
model = strip_pruning(model)
9: Save the pruned model for inference
Graph Optimization and Conversion:
The optimized models were ported to mobile-compatible forms: TensorFlow Lite (TFLite) for MobileNetV2 and EfficientNet-Lite0, and Torch Script for ShuffleNetV2.
Algorithm 3 outlines the pipeline for converting a trained deep learning model into a mobile-ready format, using graph- level optimizations such as constant folding and operator fusion. The output models (TFLite or Torch Script) are significantly more efficient and suitable for inference on mobile CPUs and NPUs.
Algorithm 3 Graph Optimization and Conversion for Mobile Deployment
1: Input: Trained deep learning model
2: Output: Optimized model in mobile-compatible format 3: Load the trained model (e.g., TensorFlow or PyTorch) 4: Apply graph optimizations:
- Constant folding
- Operator fusion
- Removal of unused nodes 5: Simplify the computational graph 6: if using TensorFlow then
7: Convert to TFLite format using TFLiteConverter
8: Apply post-training quantization if enabled
9: else if using PyTorch then
10: Trace or script the model using torch.jit
11: Save the model as TorchScript format
12: end if
13: Validate optimized model output for consistency
14: Deploy the model to the target mobile runtime
These optimization and training steps were carried out on a high-end workstation with a separate NVIDIA RTX 3080 GPU as well as 64GB of RAM. The models ran on three test devices with distinct hardware specs. Android Studio, PyTorch Mobile runtime, as well as TensorFlow Lite Interpreter, were used for inference as well as deployment. The finished models each have distinct trade-offs in terms of size, accuracy, and performance, which are essential to real-world deployment in mobile scenarios. These variations lie at the foundation of comparative analysis given in subsequent sections.
Mobile implementation
Deploying image recognition models to mobile hardware poses a unique set of challenges to those experienced in a traditional desktop or server context. Some of these challenges include hardware diversity, power constraints, limited
computationally available resources, as well as the requirement for real-time inference. Thus, in conducting this research, optimized models needed to be converted to forms executable by mobile devices, as well as deployed to different devices for empirical studies.
Mobile Hardware Specifications: Experiments were carried out on three Android smartphones, with specifications summarized in Table 3:
Table 3.
Deployment Environment Details for Mobile Devices
|
Device |
SoC |
RAM |
OS Version |
Toolkits Used |
|
Device A |
Snapdragon 730G |
6 GB |
Android12 |
TensorFlow |
|
|
|
|
|
Lite, JNI |
|
Device B |
Exynos 9611 |
4 GB |
Android 11 |
PyTorch Mobile |
|
|
|
|
|
|
|
Device C |
MediaTek Helio G95 |
8 GB |
Android 12 |
TensorFlow Lite |
Models were incorporated into a custom Android application that allows real-time classification from input obtained from the rear camera feed. For added functionality, a simplified user interface was designed.
Results and discussion
Models were evaluated based on Top-1 accuracy, average on-device inference time per image, and energy consumption over 100 predictions using Android Battery Historian. The results are showed in the Table 4.
Table 4.
Model Performance Comparison on Mobile Devices
|
Model |
Accuracy (%) |
Avg. Inference Time (ms) |
Energy Usage (mwah) |
|
MobileNetV2 ShuffleNetV2 Efficient Net- Lite0 |
74.6 72.8 76.4 |
45 38 58 |
1.2 1.0 1.5 |
/Nagymetzhan.files/image002.png)
Figure 2. Accuracy vs. Inference time
Figure 2 is a diagram of comparing the three models in terms of Accuracy (%) and Inference Time (ms).
The outcomes of the evaluation suggest a clear trade-off in terms of accuracy, inference time, and energy efficiency among selected light-weight models. From Table 4, EfficientNet-Lite0 registered the highest accuracy of 76.4% but also produced the slowest inference time of 58 ms as well as highest energy usage at 1.5 mwah. On the other hand, ShuffleNetV2 recorded the fastest inference time of 38 ms as well as lowest energy usage at 1.0 mwah, though with mid-tier accuracy of 72.8%. MobileNetV2 showed a balanced performance, registering an accuracy of 74.6% along with an inference time of 45 ms as well as energy usage of 1.2 mwah.
Conclusion
This research explored image recognition algorithm design and optimization for mobile devices emphasizing accuracy- computation-power trade-off. By comparing light-weight convolutional neural network architectures—MobileNetV2,
ShuffleNetV2, and EfficientNet-Lite0—our paper revealed that each of these three models presents a unique tradeoff for different scenarios in mobile applications. We demonstrated experimentally in real Android devices that while Efficient Net- Lite0 achieves highest accuracy, ShuffleNetV2 provides better efficiency in terms of latency as well as energy efficiency and hence more suited to real-time or energy-constrained applications.
By combining post-training quantization, structural pruning, as well as knowledge distillation with other methods, we achieved extra efficiency in the model with no loss of accuracy. The findings suggest the requirement for algorithm optimization for responsive, power-conscious mobile AI applications.
References:
- X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shuffle net: An extremely efficient convolutional neural network for mobile devices,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 6 2018, pp. 6848–6856.
- Y. Tian, “Artificial intelligence image recognition method based on convolutional neural network algorithm,” IEEE Access, vol. 8, pp. 125 731–125 744, 2020.
- B. R´ıos-Sa´nchez, D. C. D. Silva, N. Martin-Yuste, and C. Sa´nchez-A´vila, “Deep learning for face recognition on mobile devices,” IET Biometrics, vol. 9, pp. 109–117, 5 2020.
- Trusov and E. Limonova, “The analysis of projective transformation algorithms for image recognition on mobile devices,” 12 2019. [Online]. Available
- H. Soliman, A. Saleh, and E. Fathi, “Face recognition in mobile devices,” International Journal of Computer Applications, vol. 73, pp. 13–20, 7 2013.
- R. Thabet, R. Mahmoudi, M. H. Bedoui, and M. H. BEDOUI, “Image processing on mobile devices: An overview,” 2014. [Online]. Available: https://hal.science/hal-03112423v1
- C. Morikawa, M. Kobayashi, M. Satoh, Y. Kuroda, T. Inomata, H. Matsuo, T. Miura, and M. Hilaga, “Image and video processing on mobile devices: a survey,” Visual Computer, vol. 37, pp. 2931–2949, 12 2021.
- J. J. Hull, X. Liu, B. Erol, J. Graham, and J. Moraleda, “Mobile image recognition.” Association for Computing Machinery (ACM), 2010, p. 84.
- T. Shen, C. Gao, and D. Xu, “The analysis of intelligent real-time image recognition technology based on mobile edge computing and deep learning,” in Journal of Real-Time Image Processing, vol. 18. Springer Science and Business Media Deutschland GmbH, 8 2021, pp. 1157– 1166
- J. C. Piao, H. S. Jung, C. P. Hong, and S. D. Kim, “Improving performance on object recognition for real-time on mobile devices,” Multimedia Tools and Applications, vol. 75, pp. 9623–9640, 8 2016.
- Martinez-Alpiste, G. Golcarenarenji, Q. Wang, and J. M. Alcaraz- Calero, “Smartphone-based real-time object recognition architecture for portable and constrained systems,” Journal of Real-Time Image Processing, vol. 19, pp. 103–115, 2 2022
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- bilenetv2: Inverted residuals and linear bottlenecks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 6 2018, pp. 4510–4520.