We study Clustor, a compact clustering toolkit implemented in Rust and exposed to Python via a thin extension module. Clustor targets a pragmatic design point: implement classical clustering algorithms with a Python-first surface that accepts and returns NumPy arrays, while remaining minimal-dependency and performance-conscious. We present a formal specification of the Clustor API contracts, map each implemented algorithm to primary literature, and provide proof sketches for key theoretical properties (objective descent, termination, statistical consistency where applicable, and EM monotonicity). We also propose a rigorous experimental evaluation plan spanning synthetic and real benchmarks, and compare Clustor’s scope and trade-offs to mainstream clustering libraries. The code can be found at https://github.com/alphavelocity/clustor
Introduction
Clustor is a Rust-accelerated clustering toolkit that integrates with Python using PyO3 and maturin, designed to provide efficient implementations of classical unsupervised learning algorithms. Instead of introducing new clustering methods, Clustor focuses on high-performance engineering, predictable memory usage, and a stable API for common clustering tasks. The toolkit includes clear mathematical definitions, algorithm explanations, pseudocode, evaluation plans, and practical examples to help users understand and apply clustering techniques.
Clustering is an important unsupervised learning method used in exploratory data analysis, representation learning, and preprocessing for machine learning models. Clustor aims to provide essential clustering tools with low dependency requirements and reliable performance, using Rust for computational kernels while allowing easy Python usage.
The toolkit supports several clustering algorithms, including K-Means with KMeans++ initialization, MiniBatchKMeans for large datasets, Bisecting KMeans for hierarchical splitting, DBSCAN and OPTICS for density-based clustering, Affinity Propagation, BIRCH for streaming data, Gaussian Mixture Models (GMM) using Expectation–Maximization, and hierarchical clustering using Ward linkage. It also includes internal cluster validation metrics such as Silhouette score, Calinski–Harabasz index, and Davies–Bouldin index.
The framework assumes data is stored as numerical matrices and supports distance metrics such as Euclidean distance and cosine distance. Theoretical properties of algorithms are discussed, including convergence of K-Means, stochastic optimization in MiniBatchKMeans, and likelihood maximization in GMMs.
An experimental evaluation plan is proposed using datasets like Iris, MNIST, Fashion-MNIST, and synthetic datasets, comparing Clustor with reference implementations such as scikit-learn. Performance is measured using clustering metrics like ARI, NMI, and internal evaluation scores.
Although Clustor offers efficient implementations of classical clustering methods, it has limitations such as quadratic scalability in some algorithms and lack of advanced indexing or parallel processing features. Future improvements may include approximate nearest neighbor search, expanded GMM covariance models, better API compatibility, and multi-threaded processing.
Conclusion
Clustor provides a compact Rust-based toolkit for classical clustering workflows in Python [1]. By mapping each operator to primary literature and stating explicit mathematical and API contracts, we aim to make the library easier to audit, benchmark, and extend.
References
[1] clustor (python package) — project description and feature list. https://pypi.org/project/clustor/, accessed 2026-03-01
[2] Fashion-mnist: A mnist-like fashion product database. https://research.zalando.com/project/fashion_mnist/fashion_mnist/, accessed 2026-03-01
[3] maturin (pyo3/mixed rust–python packaging tool) — repository and documentation. https://github.com/PyO3/maturin, accessed 2026-03-01
[4] maturin user guide: Bindings. https://www.maturin.rs/bindings, accessed 2026-03-01
[5] Pyo3: Rust bindings for the python interpreter. https://pyo3.rs/main/, accessed 2026-03-01
[6] Iris dataset — uci machine learning repository. https://archive.ics.uci.edu/dataset/53/iris (1936). https://doi.org/10.24432/C56C76, donated 1988; DOI 10.24432/C56C76; Accessed 2026-03-01
[7] Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: Ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD ’99). pp. 49–60. ACM (1999). https://doi.org/10.1145/304181.304187
[8] Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’07). pp. 1027–1035. SIAM (2007), https://research.google/pubs/k-means-the-advantages-of-careful-seeding/
[9] Cali?ski, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics 3(1), 1–27 (1974). https://doi.org/10.1080/03610927408827101
[10] Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1(2), 224–227 (1979). https://doi.org/10.1109/TPAMI.1979.4766909
[11] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22 (1977). https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
[12] Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003) (2003), https://vvvvw.aaai.org/Library/ICML/2003/icml03-022.php
[13] Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96). pp. 226–231 (1996), https://procedings.aaai.org/Library/KDD/1996/kdd96-037.php
[14] Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007). https://doi.org/10.1126/science.1136800
[15] Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. Journal of Intelligent Information Systems 17(2–3), 107–145 (2001). https://doi.org/10.1023/A:1012801612483
[16] Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2(1), 193–218 (1985). https://doi.org/10.1007/BF01908075
[17] LeCun, Y., Cortes, C., Burges, C.J.C.: The mnist database of handwritten digits. https://yann.lecun.org/exdb/mnist/index.html, accessed 2026-03-01
[18] Lloyd, S.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
[19] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(85), 2825–2830 (2011), https://www.jmlr.org/papers/v12/pedregosa11a.html
[20] Pollard, D.: Strong consistency of -means clustering. The Annals of Statistics 9(1), 135–140 (1981). https://doi.org/10.1214/aos/1176345339
[21] Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65 (1987). https://doi.org/10.1016/0377-0427(87)90125-7
[22] Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web (WWW 2010). pp. 1177–1178. ACM (2010). https://doi.org/10.1145/1772690.1772862
[23] Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11(95), 2837–2854 (2010), https://jmlr.org/beta/papers/v11/vinh10a.html
[24] Ward, J.H.: Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58(301), 236–244 (1963). https://doi.org/10.1080/01621459.1963.10500845
[25] Wu, C.F.J.: On the convergence properties of the EM algorithm. The Annals of Statistics 11(1), 95–103 (1983)
[26] Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD ’96). pp. 103–114. ACM (1996). https://doi.org/10.1145/233269.233324
[27] Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery (1997), https://research.ibm.com/publications/birch-a-new-data-clustering-algorithm-and-its-applications