Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters

Dimitrios Papakyriakou; Ioannis S. Barbounakis

Research Article

Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters

by Dimitrios Papakyriakou, Ioannis S. Barbounakis

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 47

Published: October 2025

Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis

10.5120/ijca2025925785

PDF

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters. International Journal of Computer Applications. 187, 47 (October 2025), 6-17. DOI=10.5120/ijca2025925785

                        @article{ 10.5120/ijca2025925785,
                        author  = { Dimitrios Papakyriakou,Ioannis S. Barbounakis },
                        title   = { Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 47 },
                        pages   = { 6-17 },
                        doi     = { 10.5120/ijca2025925785 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Dimitrios Papakyriakou
                        %A Ioannis S. Barbounakis
                        %T Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 47
                        %P 6-17
                        %R 10.5120/ijca2025925785
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

The increasing demand for lightweight and energy-efficient deep learning models at the edge has fueled interest in training convolutional neural networks (CNNs) directly on ARM-based CPU clusters. This study examines the feasibility and performance constraints of distributed training for the compact SqueezeNet v1.1 architecture, implemented using an MPI-based parallel framework on a Beowulf cluster composed of Raspberry Pi devices. Experimental evaluation across up to 24 Raspberry Pi nodes (48 MPI processes) reveals a sharp trade-off between training acceleration and model generalization. While wall-clock training time improves by over (11×) under increased parallelism, test accuracy deteriorates significantly, collapsing to chance-level performance (≈10%) as data partitions per process become excessively small. This behavior highlights a statistical scaling limit, beyond which computational gains are offset by learning inefficiency. The findings are consistent with the statistical bottlenecks identified by Shallue et al. (2019) [11], extending their observations from large-scale GPU/CPU systems to energy-constrained ARM-based edge clusters. These findings underscore the importance of balanced task decomposition in CPU-bound environments and contribute new insights into the complex interplay between model compactness, data sparsity, and parallel training efficiency in edge-AI systems. This framework also provides a viable low-power platform for real-time SNN research on edge devices.

References

Shi, W., Cao, J., Zhang, Q., Li, Y., & Xu, L. (2016). Edge computing: Vision and challenges. IEEE Internet of Things Journal, 3(5), 637–646. https://doi.org/10.1109/JIOT.2016.2579198.
Li, S., Xu, L. D., & Zhao, S. (2018). 5G Internet of Things: A survey. Journal of Industrial Information Integration,10,1-9 https://doi.org/10.1016/j.jii.2018.01.005
Sze, V., Chen, Y. H., Yang, T. J., & Emer, J. S. (2017). Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, 105(12), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.org/abs/1704.04861
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv Preprint. https://doi.org/10.48550/arXiv.1602.07360
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2017). Pruning filters for efficient convnets. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1608.08710
Ramesh, S., & Chakrabarty, K. (2021). Challenges and opportunities in training deep neural networks on edge devices. ACM Transactions on Embedded Computing Systems (TECS), 20(5s), 1–26. https://doi.org/10.1145/3477084
Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks.
Masters, D., & Luschi, C. (2018). Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612. https://arxiv.org/abs/1804.07612
Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., & Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112), 1–49. http://jmlr.org/papers/v20/18-789.html
Ben-Nun, T., & Hoefler, T. (2019). Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys, 52(4), 1–43. https://doi.org/10.1145/3320060

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

SqueezeNet Distributed Deep Learning Edge Computing Raspberry Pi Cluster Beowulf Cluster ARM Architecture MPI (Message Passing Interface) Low-Power AI Strong Scaling Model Generalization Statistical Scaling Limit