The increasing volume of digital content in multiple languages has created a strong need for intelligent systems that can organize and retrieve multilingual documents efficiently. This project introduces a comprehensive pipeline for clustering and semantic search of multilingual text documents, supporting English, Hindi, and Telugu. The system begins by accepting PDF documents and identifying their language using the langdetect library. This is followed by language-specific preprocessing, including Unicode normalization, sentence tokenization, punctuation removal, stopword elimination, and lemmatization (for English). After preprocessing, the cleaned texts are transformed into semantic embeddings using the paraphrase-multilingual-MiniLM-L12-v2 model from Sentence Transformers. These embeddings are then passed through Agglomerative Clustering based on cosine distance to group similar documents. The clustered results are projected onto a two-dimensional space using UMAP for visualization and further analyzed using cosine similarity heatmaps. To enhance clustering, the system incorporates a semantic search module that retrieves top documents across languages using cosine similarity between query and document embeddings. The system’s effectiveness is demonstrated through metrics evaluating both language detection accuracy and clustering performance, supported by visualization techniques.
Introduction
In a multilingual digital environment like India’s, analyzing documents in multiple languages (English, Hindi, Telugu) is challenging due to the limitations of traditional monolingual systems. This project introduces an unsupervised system for clustering and semantic search of multilingual documents without relying on labeled data.
System Design
1. Language Detection
Uses the LangDetect library to identify the document’s language (EN, HI, TE).
Achieved 100% accuracy across 15 test documents.
2. Preprocessing
Performed language-specific text cleaning:
Sentence segmentation
Tokenization
Stopword removal
Lowercasing
Lemmatization (for English)
Uses NLTK (English) and Indic NLP Toolkit (Hindi, Telugu).
3. Semantic Embedding
Utilizes multilingual Sentence-BERT (paraphrase-multilingual-MiniLM-L12-v2) to generate 384-dimensional embeddings for documents.
No reliance on translations or annotated datasets.
Adaptable for education, governance, and content management in multilingual settings.
Conclusion
This project successfully developed a multilingual system for clustering and semantic search of text documents in English, Hindi, and Telugu. The primary goal was to enable meaningful grouping and retrieval of documents across languages using a robust, language-aware pipeline. The first phase of the pipeline focused on language detection, which achieved an impressive 100% accuracy, correctly identifying all 15 test documents (5 each in EN, HI, and TE) as per their respective languages. This step ensured that subsequent processing was language-specific and tailored to the document content. In the second phase, Agglomerative Clustering was applied to semantically embed and group the documents within each language. The clustering quality was evaluated using the Silhouette Score, which ranged from 0.25 to 0.41:
1) English documents: 0.336
2) Hindi documents: 0.250
3) Telugu documents: 0.409
These values indicate moderately meaningful clustering, with the Telugu documents showing the strongest intra-cluster cohesion. Visualizations using UMAP helped confirm the spatial separation of clusters, while cosine similarity heatmaps supported the content similarity between documents within each cluster. Finally, a semantic search engine was integrated into the system. For a sample query like \"operating system concepts\", the top 3 retrieved documents spanned across English and Telugu languages with cosine similarity scores of 0.5933, 0.5628, and 0.4968, respectively. This demonstrated the model\'s ability to find semantically relevant documents across different languages, leveraging BERT embeddings for meaningful comparison.
References
[1] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084
[2] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019).BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT.
[3] McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.https://arxiv.org/abs/1802.03426
[4] Scikit-learn: Machine Learning in Python.Pedregosa, F., et al. (2011). Journal of Machine Learning Research, 12, 2825–2830.
[5] Langdetect - Language Detection Library in Python.https://pypi.org/project/langdetect/
[6] Natural Language Toolkit (NLTK). Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media Inc. https://www.nltk.org
[7] Agglomerative Clustering — scikit-learn Documentation.
[8] NPTEL Online Courses – Video Transcript Dataset Source. https://nptel.ac.in
[9] Indic NLP Library.Kunchukuttan, A. https://github.com/anoopkunchukuttan/indic_nlp_library