Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation

Dimitrios Papakyriakou; Ioannis S. Barbounakis

Research Article

Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation

by Dimitrios Papakyriakou, Ioannis S. Barbounakis

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Issue 32

Published: August 2025

Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis

10.5120/ijca2025925585

PDF

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation. International Journal of Computer Applications. 187, 32 (August 2025), 43-55. DOI=10.5120/ijca2025925585

                        @article{ 10.5120/ijca2025925585,
                        author  = { Dimitrios Papakyriakou,Ioannis S. Barbounakis },
                        title   = { Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 32 },
                        pages   = { 43-55 },
                        doi     = { 10.5120/ijca2025925585 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }

                        %0 Journal Article
                        %D 2025
                        %A Dimitrios Papakyriakou
                        %A Ioannis S. Barbounakis
                        %T Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 32
                        %P 43-55
                        %R 10.5120/ijca2025925585
                        %I Foundation of Computer Science (FCS), NY, USA

Abstract

This study presents an in-depth parallel benchmarking analysis of the k-Means clustering algorithm on a Beowulf cluster composed of Raspberry Pi 4B nodes, each equipped with 8GB of RAM. Leveraging MPI for distributed computation, it is systematically evaluating the algorithm’s strong scaling behaviour using synthetic datasets of fixed size -75 million two-dimensional points - while varying the number of MPI processes from 2 up to 48 (with two processes per node). The performance evaluation focuses on a detailed execution time decomposition across five key phases: data generation, parallel distance computation (Compute Phase), synchronization via MPI_Allreduce (Sync Phase), centroid updates (Update Phase), (k-Means Phase) and total runtime. Results confirm that the Compute Phase remains the dominant contributor to total runtime, consistently accounting for the majority of execution time across all configurations. Synchronization overhead increases moderately at intermediate process counts, a typical phenomenon in distributed systems, but remains manageable and does not offset the overall speedup achieved through parallelization. The Beowulf cluster demonstrates excellent scalability and high parallel efficiency throughout the strong scaling experiments, with total runtime reduced by nearly (10×) when increasing from 2 to 48 MPI processes. Memory usage remains within physical RAM limits due to careful dataset partitioning, enabling large-scale processing on low-power ARM-based nodes. Overall, this work highlights the feasibility and efficiency of CPU-centric, memory-aware distributed machine learning on energy-efficient Raspberry Pi clusters. The proposed benchmarking framework provides a robust and reproducible foundation for analysing algorithmic performance, scalability, and resource utilization in lightweight distributed environments, aligning with contemporary trends in edge computing and resource-constrained high-performance computing.

References

Dimitrios Papakyriakou, Ioannis S. Barbounakis. Data Mining Methods: A Review. International Journal of Computer Applications. 183, 48 (Jan 2022), 5-19. DOI=10.5120/ijca2022921884
Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks
Aurelien, M. (2022). PEP 668 – Marking Python base environments as externally managed. Python Software Foundation. https://peps.python.org/pep-0668/
J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008
M. Zaharia et al., "Apache Spark: A unified engine for big data processing," Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016
A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in TensorFlow," arXiv preprint arXiv:1802.05799, 2018
Google, "Multi Worker Mirrored Strategy Guide," TensorFlow Docs, 2023
W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message Passing Interface, 3rd ed., MIT Press, 2014
J. Dongarra et al., "High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems," Int. J. High Perform. Comput. Appl., vol. 30, no. 1, pp. 3–10, Feb. 2016
M. Rocklin, "Dask: Parallel computation with blocked algorithms and task scheduling," Proc. 14th Python in Science Conference, 2015
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16) (pp. 265–283). USENIX Association
Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS) Workshop
Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH. In Proceedings of the International Conference on Computational Science (ICCS 2005) (pp. 49–57). Springer
Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492
Gropp, W., Lusk, E., & Skjellum, A. (2014). Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT press.
Dongarra, J., Beckman, P., Moore, T., et al. (2021). The International Exascale Software Project Roadmap. International Journal of High-Performance Computing Applications, 35(1), 3–60
Kogias, E., Christou, I. T., & Triantafyllidis, G. (2020). Distributed Machine Learning on Edge Devices: A Survey. IEEE Access, 8, 211309–211328
Mariani, L., Bartolini, A., Borghi, G., & Benini, L. (2022). Scalable Edge Machine Learning on Raspberry Pi Clusters. Future Generation Computer Systems, 128, 190–203

Index Terms

Computer Science

Information Sciences

No index terms available.

Keywords

Raspberry Pi 4B Beowulf Cluster ARM Architecture Parallel Computing CPU-Bound Workload k-Means Clustering Message Passing Interface (MPI) MPICH Memory-Conscious Scaling Low-Cost Clusters Synthetic Data Benchmarking Execution Time Analysis Distributed Systems HPC Performance Evaluation