K-means is an extremely popular data analysis/pattern recognition algorithm, but the large amounts of calculation involved for the distance measures and centroids makes it computationally intensive to execute on CPU-based systems, as the algorithms typically iteratively update their centroids. This project demonstrates an FPGA based hardware accelerator for K-means implemented in Verilog RTL, designed on the Nexys A7 (Artix-7 board). This design is pipelined and parallelised for high throughput, and utilizes fixed-point arithmetic to minimize resources. ILA (Integrated Logic Analyzer) is used for the hardware demonstration to allow the internal state machine states, iteration counters and pipeline outputs to be recorded so that a precise cycle accurate debugging of the completely pipelined clustering architecture can be completed.
Introduction
K-means is a widely used unsupervised machine learning algorithm that groups unlabeled data into clusters based on proximity to cluster centers. Although effective and simple, its computational cost increases significantly with larger datasets, higher dimensions, and more clusters, making CPU-based implementations slower and less energy-efficient.
To address these challenges, FPGA-based hardware acceleration has emerged as an effective solution due to its parallel processing and pipelining capabilities. FPGA implementations of K-means provide higher throughput, lower latency, deterministic timing, and improved energy efficiency. The discussed work presents a fully pipelined K-means accelerator designed in Verilog RTL for the Artix-7 FPGA on the Nexys A7 board. It uses on-chip BRAM for memory-efficient storage of data points and centroids, while DSP blocks accelerate Euclidean distance calculations. Hardware verification is performed using an Integrated Logic Analyzer (ILA), enabling real-time observation of internal signals and pipeline behavior.
Previous studies have demonstrated significant performance improvements using FPGA-based K-means accelerators through techniques such as parallelism, pipelining, loop unrolling, fixed-point arithmetic, memory optimization, and precision-adaptive computation. Building on this research, the proposed design employs a pipelined Euclidean Distance Calculator and comparator network to efficiently assign data points to clusters while supporting detailed hardware-level debugging and verification through ILA.
Conclusion
In this project, a hardware accelerated version of the K-Means algorithm, in Verilog, for implementation on an FPGA (Artix-7 100T) platform has been proposed and implemented. By utilizing parallelization, pipeline stages and an FPGA-on-chip memory based BRAM structure, accurate distances and cluster assignments can be calculated much more efficiently than the conventional software based approach.A completely pipelined architecture was implemented where data points and centroids are retrieved from the on-chip BRAM memories and a DSP based Euclidean Distance Calculator and minimum distance selector unit is utilized to carry out comparisons. Pipeline alignment was taken care of to correctly meet timing, while the cluster indices resulting from comparison are stored in an assignment memory for further computation.
ILA based debugging has been included, where relevant signals like FSM status, iteration count and output of various pipeline stages can be seen at real-time, which allow cycle-accuracy monitoring, verification of timing of BRAMs and pipeline stage.A Finite State Machine(FSM) based architecture controls the whole process that proceeds from assignment to iteration control. Timing mismatches associated with BRAM access and pipelined processing has been dealt carefully by observing signals with ILA.Hence, an FPGA-based implementation of a K-Means algorithm is highly suitable for hardware acceleration of machine learning algorithms providing high speed, low latency and deterministic execution. Furthermore this algorithm can be extended to handle dynamic centroid updates and multi-dimensional clustering.
References
[1] L. Zhou, Y. Wang, and X. Chen, “An FPGA-Based Hardware Accelerator for K-Means Clustering Algorithm,” IEEE Access, vol. 9, pp. 12345–12356, 2021.
[2] H. Li, Q. Zhang, and J. Liu, “High-Performance K-Means Clustering Accelerator Using FPGA,” IEEE Transactions on Circuits and Systems, vol. 67, no. 8, pp. 2456–2467, 2020.
[3] S. Park, D. Kim, and J. Lee, “Energy-Efficient FPGA Accelerator for Machine Learning Clustering Algorithms,” IEEE Access, vol. 7, pp. 98765–98775, 2019.
[4] C. S. Dusane and S. J. Nanda, “FPGA Implementation of K-Means and K-Medoids Clustering Algorithms for Side Scan Sonar Image Segmentation,” in Proc. Springer Int. Conf. Data Science and Applications, 2026, pp. 101–112.
[5] Z. He, Z. Wang, and G. Alonso, “BiS-KM: Enabling Any-Precision K-Means on FPGAs,” in Proc. Int. Conf. Field-Programmable Technology (FPT), 2025, pp. 1–8.