Multilingual Text Document Clustering and Classification

Authors: Divya Katta, Dr. M. Dhanalakshmi

DOI Link: https://doi.org/10.22214/ijraset.2025.73119

Abstract

The increasing volume of digital content in multiple languages has created a strong need for intelligent systems that can organize and retrieve multilingual documents efficiently. This project introduces a comprehensive pipeline for clustering and semantic search of multilingual text documents, supporting English, Hindi, and Telugu. The system begins by accepting PDF documents and identifying their language using the langdetect library. This is followed by language-specific preprocessing, including Unicode normalization, sentence tokenization, punctuation removal, stopword elimination, and lemmatization (for English). After preprocessing, the cleaned texts are transformed into semantic embeddings using the paraphrase-multilingual-MiniLM-L12-v2 model from Sentence Transformers. These embeddings are then passed through Agglomerative Clustering based on cosine distance to group similar documents. The clustered results are projected onto a two-dimensional space using UMAP for visualization and further analyzed using cosine similarity heatmaps. To enhance clustering, the system incorporates a semantic search module that retrieves top documents across languages using cosine similarity between query and document embeddings. The system’s effectiveness is demonstrated through metrics evaluating both language detection accuracy and clustering performance, supported by visualization techniques.

Introduction

In a multilingual digital environment like India’s, analyzing documents in multiple languages (English, Hindi, Telugu) is challenging due to the limitations of traditional monolingual systems. This project introduces an unsupervised system for clustering and semantic search of multilingual documents without relying on labeled data.

System Design

1. Language Detection

Uses the LangDetect library to identify the document’s language (EN, HI, TE).
Achieved 100% accuracy across 15 test documents.

2. Preprocessing

Performed language-specific text cleaning:
- Sentence segmentation
- Tokenization
- Stopword removal
- Lowercasing
- Lemmatization (for English)
Uses NLTK (English) and Indic NLP Toolkit (Hindi, Telugu).

3. Semantic Embedding

Utilizes multilingual Sentence-BERT (paraphrase-multilingual-MiniLM-L12-v2) to generate 384-dimensional embeddings for documents.
Embeddings capture language-independent semantic meaning.

4. Clustering

Applies Agglomerative Hierarchical Clustering using cosine similarity to group documents.
Clustering quality evaluated using Silhouette Score:
- English: 0.3366
- Hindi: 0.2502
- Telugu: 0.4107

5. Visualization

Embeddings are projected into 2D space using UMAP for visual inspection.
Cosine similarity heatmaps show intra-cluster similarity.

6. Semantic Search

Enables cross-lingual search by embedding a query and retrieving top-matching documents based on cosine similarity.

Results

Language Detection: 100% accurate on 15 documents with no misclassifications.
Clustering:
- English: 3 clusters with strong grouping (e.g., op.pdf & op(2).pdf: similarity = 0.8531)
- Hindi: 3 clusters; programming documents grouped well (e.g., C++.pdf & C++(5).pdf: similarity = 0.8476)
- Telugu: Best clustering with clear separation (e.g., c++(2).pdf & c++(3).pdf: similarity = 0.9124)

Key Contributions

Language-independent, scalable document analysis pipeline.
No reliance on translations or annotated datasets.
Adaptable for education, governance, and content management in multilingual settings.

Conclusion

This project successfully developed a multilingual system for clustering and semantic search of text documents in English, Hindi, and Telugu. The primary goal was to enable meaningful grouping and retrieval of documents across languages using a robust, language-aware pipeline. The first phase of the pipeline focused on language detection, which achieved an impressive 100% accuracy, correctly identifying all 15 test documents (5 each in EN, HI, and TE) as per their respective languages. This step ensured that subsequent processing was language-specific and tailored to the document content. In the second phase, Agglomerative Clustering was applied to semantically embed and group the documents within each language. The clustering quality was evaluated using the Silhouette Score, which ranged from 0.25 to 0.41: 1) English documents: 0.336 2) Hindi documents: 0.250 3) Telugu documents: 0.409 These values indicate moderately meaningful clustering, with the Telugu documents showing the strongest intra-cluster cohesion. Visualizations using UMAP helped confirm the spatial separation of clusters, while cosine similarity heatmaps supported the content similarity between documents within each cluster. Finally, a semantic search engine was integrated into the system. For a sample query like \"operating system concepts\", the top 3 retrieved documents spanned across English and Telugu languages with cosine similarity scores of 0.5933, 0.5628, and 0.4968, respectively. This demonstrated the model\'s ability to find semantically relevant documents across different languages, leveraging BERT embeddings for meaningful comparison.

References

[1] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084 [2] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019).BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT. [3] McInnes, L., Healy, J., & Melville, J. (2018).UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.https://arxiv.org/abs/1802.03426 [4] Scikit-learn: Machine Learning in Python.Pedregosa, F., et al. (2011). Journal of Machine Learning Research, 12, 2825–2830. [5] Langdetect - Language Detection Library in Python.https://pypi.org/project/langdetect/ [6] Natural Language Toolkit (NLTK). Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media Inc. https://www.nltk.org [7] Agglomerative Clustering — scikit-learn Documentation. [8] NPTEL Online Courses – Video Transcript Dataset Source. https://nptel.ac.in [9] Indic NLP Library.Kunchukuttan, A. https://github.com/anoopkunchukuttan/indic_nlp_library

Copyright

Copyright © 2025 Divya Katta, Dr. M. Dhanalakshmi. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET73119

Publish Date : 2025-07-11

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here