Research Article

Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation

by  Dimitrios Papakyriakou, Ioannis S. Barbounakis
journal cover
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Issue 32
Published: August 2025
Authors: Dimitrios Papakyriakou, Ioannis S. Barbounakis
10.5120/ijca2025925585
PDF

Dimitrios Papakyriakou, Ioannis S. Barbounakis . Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation. International Journal of Computer Applications. 187, 32 (August 2025), 43-55. DOI=10.5120/ijca2025925585

                        @article{ 10.5120/ijca2025925585,
                        author  = { Dimitrios Papakyriakou,Ioannis S. Barbounakis },
                        title   = { Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation },
                        journal = { International Journal of Computer Applications },
                        year    = { 2025 },
                        volume  = { 187 },
                        number  = { 32 },
                        pages   = { 43-55 },
                        doi     = { 10.5120/ijca2025925585 },
                        publisher = { Foundation of Computer Science (FCS), NY, USA }
                        }
                        %0 Journal Article
                        %D 2025
                        %A Dimitrios Papakyriakou
                        %A Ioannis S. Barbounakis
                        %T Parallel k-Means Benchmarking on a CPU-Bound Beowulf Cluster of Raspberry Pi Nodes: An MPI-Based Scaling Analysis with CPU-Centric Performance Evaluation%T 
                        %J International Journal of Computer Applications
                        %V 187
                        %N 32
                        %P 43-55
                        %R 10.5120/ijca2025925585
                        %I Foundation of Computer Science (FCS), NY, USA
Abstract

This study presents an in-depth parallel benchmarking analysis of the k-Means clustering algorithm on a Beowulf cluster composed of Raspberry Pi 4B nodes, each equipped with 8GB of RAM. Leveraging MPI for distributed computation, it is systematically evaluating the algorithm’s strong scaling behaviour using synthetic datasets of fixed size -75 million two-dimensional points - while varying the number of MPI processes from 2 up to 48 (with two processes per node). The performance evaluation focuses on a detailed execution time decomposition across five key phases: data generation, parallel distance computation (Compute Phase), synchronization via MPI_Allreduce (Sync Phase), centroid updates (Update Phase), (k-Means Phase) and total runtime. Results confirm that the Compute Phase remains the dominant contributor to total runtime, consistently accounting for the majority of execution time across all configurations. Synchronization overhead increases moderately at intermediate process counts, a typical phenomenon in distributed systems, but remains manageable and does not offset the overall speedup achieved through parallelization. The Beowulf cluster demonstrates excellent scalability and high parallel efficiency throughout the strong scaling experiments, with total runtime reduced by nearly (10×) when increasing from 2 to 48 MPI processes. Memory usage remains within physical RAM limits due to careful dataset partitioning, enabling large-scale processing on low-power ARM-based nodes. Overall, this work highlights the feasibility and efficiency of CPU-centric, memory-aware distributed machine learning on energy-efficient Raspberry Pi clusters. The proposed benchmarking framework provides a robust and reproducible foundation for analysing algorithmic performance, scalability, and resource utilization in lightweight distributed environments, aligning with contemporary trends in edge computing and resource-constrained high-performance computing.

References
  • Dimitrios Papakyriakou, Ioannis S. Barbounakis. Data Mining Methods: A Review. International Journal of Computer Applications. 183, 48 (Jan 2022), 5-19. DOI=10.5120/ijca2022921884
  • Raspberry Pi 4 Model B. [Online]. Available: raspberrypi.com/products/raspberry-pi-4-model-b/.
  • Raspberry Pi 4 Model B specifications. [Online]. Available: https://magpi.raspberrypi.com/articles/raspberry-pi-4-specs-benchmarks
  • Aurelien, M. (2022). PEP 668 – Marking Python base environments as externally managed. Python Software Foundation. https://peps.python.org/pep-0668/
  • J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008
  • M. Zaharia et al., "Apache Spark: A unified engine for big data processing," Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016
  • A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in TensorFlow," arXiv preprint arXiv:1802.05799, 2018
  • Google, "Multi Worker Mirrored Strategy Guide," TensorFlow Docs, 2023
  • W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message Passing Interface, 3rd ed., MIT Press, 2014
  • J. Dongarra et al., "High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems," Int. J. High Perform. Comput. Appl., vol. 30, no. 1, pp. 3–10, Feb. 2016
  • M. Rocklin, "Dask: Parallel computation with blocked algorithms and task scheduling," Proc. 14th Python in Science Conference, 2015
  • Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association
  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Zheng, X. (2016). TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16) (pp. 265–283). USENIX Association
  • Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS) Workshop
  • Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of collective communication operations in MPICH. In Proceedings of the International Conference on Computational Science (ICCS 2005) (pp. 49–57). Springer
  • Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. https://doi.org/10.1145/1327452.1327492
  • Gropp, W., Lusk, E., & Skjellum, A. (2014). Using MPI: Portable Parallel Programming with the Message-Passing Interface. MIT press.
  • Dongarra, J., Beckman, P., Moore, T., et al. (2021). The International Exascale Software Project Roadmap. International Journal of High-Performance Computing Applications, 35(1), 3–60
  • Kogias, E., Christou, I. T., & Triantafyllidis, G. (2020). Distributed Machine Learning on Edge Devices: A Survey. IEEE Access, 8, 211309–211328
  • Mariani, L., Bartolini, A., Borghi, G., & Benini, L. (2022). Scalable Edge Machine Learning on Raspberry Pi Clusters. Future Generation Computer Systems, 128, 190–203
Index Terms
Computer Science
Information Sciences
No index terms available.
Keywords

Raspberry Pi 4B Beowulf Cluster ARM Architecture Parallel Computing CPU-Bound Workload k-Means Clustering Message Passing Interface (MPI) MPICH Memory-Conscious Scaling Low-Cost Clusters Synthetic Data Benchmarking Execution Time Analysis Distributed Systems HPC Performance Evaluation

Powered by PhDFocusTM