|
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
|
| Volume 187 - Issue 47 |
| Published: October 2025 |
| Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis |
10.5120/ijca2025925785
|
Dimitrios Papakyriakou, Ioannis S. Barbounakis . Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters. International Journal of Computer Applications. 187, 47 (October 2025), 6-17. DOI=10.5120/ijca2025925785
@article{ 10.5120/ijca2025925785,
author = { Dimitrios Papakyriakou,Ioannis S. Barbounakis },
title = { Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters },
journal = { International Journal of Computer Applications },
year = { 2025 },
volume = { 187 },
number = { 47 },
pages = { 6-17 },
doi = { 10.5120/ijca2025925785 },
publisher = { Foundation of Computer Science (FCS), NY, USA }
}
%0 Journal Article
%D 2025
%A Dimitrios Papakyriakou
%A Ioannis S. Barbounakis
%T Deep Learning for Edge AI: SqueezeNet CNN Training on Distributed ARM-Based Clusters%T
%J International Journal of Computer Applications
%V 187
%N 47
%P 6-17
%R 10.5120/ijca2025925785
%I Foundation of Computer Science (FCS), NY, USA
The increasing demand for lightweight and energy-efficient deep learning models at the edge has fueled interest in training convolutional neural networks (CNNs) directly on ARM-based CPU clusters. This study examines the feasibility and performance constraints of distributed training for the compact SqueezeNet v1.1 architecture, implemented using an MPI-based parallel framework on a Beowulf cluster composed of Raspberry Pi devices. Experimental evaluation across up to 24 Raspberry Pi nodes (48 MPI processes) reveals a sharp trade-off between training acceleration and model generalization. While wall-clock training time improves by over (11×) under increased parallelism, test accuracy deteriorates significantly, collapsing to chance-level performance (≈10%) as data partitions per process become excessively small. This behavior highlights a statistical scaling limit, beyond which computational gains are offset by learning inefficiency. The findings are consistent with the statistical bottlenecks identified by Shallue et al. (2019) [11], extending their observations from large-scale GPU/CPU systems to energy-constrained ARM-based edge clusters. These findings underscore the importance of balanced task decomposition in CPU-bound environments and contribute new insights into the complex interplay between model compactness, data sparsity, and parallel training efficiency in edge-AI systems. This framework also provides a viable low-power platform for real-time SNN research on edge devices.