Clustering is a fundamental unsupervised learning problem, essential for understanding the intrinsic structure of high-dimensional data. Traditional clustering methods such as K-means assume that data clusters are globular and can be well-approximated by Euclidean distance. However, in high-dimensional settings, many real-world datasets lie on low-dimensional manifolds, and clustering these datasets requires methods that respect the underlying manifold structure. This paper introduces a novel approach, Linear Manifold Clustering (LMC), which assumes that data points reside on a linear submanifold of the higher-dimensional space. By leveraging techniques from manifold learning and linear algebra, LMC enhances clustering performance by incorporating the geometric properties of the data. Our approach outperforms traditional clustering algorithms in both clustering accuracy and computational efficiency on high-dimensional datasets, as demonstrated in experiments on synthetic and real-world datasets.
Introduction
Clustering is a key machine learning technique used to find hidden patterns in data, but traditional methods like K-means and DBSCAN struggle with high-dimensional data that lie on low-dimensional manifolds rather than forming simple clusters. This paper introduces Linear Manifold Clustering (LMC), which assumes data lies on or near a linear submanifold within a high-dimensional space. LMC first uses Principal Component Analysis (PCA) to project data onto a lower-dimensional linear manifold, then applies clustering (e.g., K-means) on this reduced representation.
Unlike prior manifold-based methods that focus mainly on dimensionality reduction, LMC integrates manifold learning with clustering, improving both cluster quality and computational efficiency.
Experiments on synthetic and real-world datasets (MNIST, CIFAR-10) show that LMC outperforms traditional clustering methods in accuracy (measured by Adjusted Rand Index and Silhouette Score) and runs faster than other manifold learning combined approaches like Spectral Clustering and Isomap + K-means.
Conclusion
We propose Linear Manifold Clustering (LMC), a new clustering approach that utilizes manifold learning to improve clustering in high-dimensional datasets. Our experiments demonstrate that LMC outperforms traditional methods in terms of both clustering quality and computational efficiency. Future work will focus on extending LMC to non-linear manifolds and applying it to larger, more complex datasets.
References
[1] Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319-2323.
[2] Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323-2326.
[3] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395-416.
[4] Rokach, L., & Maimon, O. (2005). Clustering methods. In Data Mining and Knowledge Discovery Handbook.
[5] Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319-2323.
[6] Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323-2326.
[7] Von Lux burg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395-416.
[8] Saurav Jyoti Sarmah, Dhruba K. Bhattacharyya, ”An Effective Technique for Clustering Incremental Gene Expression data”, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No 3, May 2010.
[9] A.E. Ezugwu et al. “A comprehensive survey of clustering algorithms: state-of-the-art machine learning applications, taxonomy, challenges, and future research prospects” ,Eng. Appl. Artif. Intel. (April 2022) , Online ISSN: 1873-6769 cience Issues, Vol. 7, Issue 3, No 3, May 2010.
[10] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD).