The embedding techniques are the great advancement in Natural Language Processing that converts plain text into numerical representation. This paper presents a review of word as well as sentence embedding techniques. The paper explains traditional embeddings approaches such as one-hot encoding, count-based models, and TF-IDF also current state-of-the-art embedding models such as Word2Vec, GloVe, BERT and transformer based. The paper categorizes word embeddings as frequency & prediction based whereas sentence embeddings are shown in separate section. The paper also presents a comparative analysis of architecture, performance, type and semantic capabilities of the models.
The embeddings alone are nothing so that semantic similarity measure techniques are applied for detecting similarity between embeddings. This paper also explains various semantic similarity measure techniques such as cosine similarity, Jaccard similarity and distance-based metrics. Intrinsic and extrinsic are two main evaluation methods are available for embeddings and those are also discussed in this paper. The evolution of static, contextual and also large language model-based embeddings is presented in the paper. The paper explains a case study using Punjabi sentences that how embeddings (numeric representation of sentences) capture the semantic similarity between text. This representation is presented using table and images for better understanding. The paper further discusses the recent embedding techniques especially multilingual and cross-lingual models such as mBERT, LaBSE and MiniLM. The type, architecture, key features, strength and limitations of the models are presented in the table. The semantic similarity between sentences of various models is also shown in the paper. The model’s name is presented with other important information such as methods/model name, dataset used in the model, evaluation metric, similarity score. One more analysis is presented in the paper about the sentence level embedding approaches available for low resource language especially for Punjabi. The challenges of limited dataset, the effectiveness of pretrained models is also shown in the analysis. Next important thing in this analysis is that it shows the fine-tuning strategies of such models. The paper concludes by presenting the important findings related to word as well sentence embeddings.
Introduction
This paper introduces Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL), explaining their relationship: AI is the broad field, ML is a subset of AI, and DL is a subset of ML. AI enables computers to perform tasks that typically require human intelligence and is widely used in healthcare, finance, transportation, agriculture, education, and industry.
The paper focuses on how Deep Learning processes text data. Since deep learning models can only understand numbers, text must first be converted into numerical representations called vectors or embeddings. These embeddings help computers understand the meaning of words and sentences and measure semantic similarity between texts.
The paper discusses both traditional and modern methods for generating embeddings:
Traditional approaches: One-Hot Encoding, Count Vectors, TF-IDF, and Co-occurrence Matrices.
Prediction-based approaches: Word2Vec, GloVe, and modern multilingual models.
It explains the limitations of One-Hot Encoding, such as high dimensionality, sparse vectors, and inability to capture semantic relationships between words. To overcome these issues, word embeddings represent words in a continuous semantic space where similar words have similar vector representations.
The paper further explores:
Frequency-based methods such as Count Vectors and TF-IDF.
Prediction-based methods like Word2Vec, including its two architectures:
Continuous Bag of Words (CBOW): Predicts a target word from its context.
Skip-Gram: Predicts surrounding context words from a target word.
A case study using Punjabi sentences demonstrates how embeddings capture semantic similarity between texts. The paper also examines modern multilingual and cross-lingual embedding models such as:
mBERT
LaBSE
MiniLM
These models are evaluated based on architecture, features, strengths, limitations, datasets, and similarity scores. Special attention is given to sentence embedding techniques for low-resource languages such as Punjabi.
The paper concludes that text embeddings are essential for modern Natural Language Processing (NLP) applications, including machine translation, summarization, sentiment analysis, paraphrase detection, and question answering. Although multilingual and cross-lingual models have improved text representation significantly, challenges such as limited datasets and the effectiveness of pretrained models remain, particularly for low-resource languages.
Conclusion
The embeddings play an important role in semantic similarity tasks such as sentiment analysis, question-answering, text retrieval and paraphrase detection. This paper presented a review of word and sentence embedding approaches from traditional to modern deep learning-based approaches. The traditional approaches are one-hot encoding, count vectors and TF-IDF which are known as the base approaches to convert text as vectors but these are suffered from high dimensionality and lack of semantic understanding. The paper then presented prediction-based models such as Word2Vec and GloVe which was the big advancement in the field of NLP. These models captured the semantic relationship between the text as they produced dense vectors. The study further discussed sentence embedding approaches, where models such as RNN-based Seq2Seq, Universal Sentence Encoder, Sentence-BERT, and SimCSE explained which are capable to capture contextual and semantic relationship at the sentence level. In another section, a comparative analysis and evaluation metrics demonstrated that transformer-based and contrastive learning approaches significantly outperform traditional methods in semantic similarity tasks. An advantage of this paper is that it included practical example and a case study using Punjabi sentences. The section just focused on the process of transforming raw text as embeddings. The analysis of models shown that traditional models are les powerful for obtaining semantic representation and faced issues like dimensionality. Whereas mBERT, MiniLM and LaBSE are the multilingual pretrained models are better for preserving semantic representation. A section on recent trends shown that there are various available models for sentence embeddings. The latest models are good enough for various NLP tasks but with some fine-tuning options available for better results.
As the study explored, traditional embedding approaches was useful as the baseline techniques, on the other hand, transformer based and multilingual models provide an effective solution to capture semantic relationship. Such models are also good for low resource language like Punjabi. The future directions for the development of scalable, efficient and context-aware embedding models for low-resource as well as for multilingual applications.
References
[1] A. M. Turing. 1950. Computing Machinery and Intelligence. Mind 59, 236 (1950), 433–460.
[2] Vaibhav Kumar, Jagdish Prasad, and Baldev Singh. 2021. Convolutional Neural Network for Classification for Indian Jewellery. In Proceedings of the International Conference on Sustainable Computing in Science, Technology & Management (SUSCOM-2019).
[3] Tom M. Mitchell. 1997. Machine Learning. McGraw-Hill.
[4] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[5] Yoav Goldberg. 2017. Neural Network Methods for Natural Language Processing. Morgan & Claypool.
[6] Daniel Jurafsky and James H. Martin. 2026. Speech and Language Processing (3rd ed. draft). Stanford University.
[7] Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 5 (1988), 513–523.
[8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of ICLR Workshop.
[9] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of EMNLP.
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT.
[11] A. Singh and G. S. Josan. 2021. A deep network model for paraphrase detection in Punjabi. In Recent Innovations in Computing (ICRIC 2020), Lecture Notes in Electrical Engineering, Vol. 701. Springer, Singapore, 173–185.
[12] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of CoNLL, 10–21.
[13] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 3104–3112.
[14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[15] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems (NeurIPS), 3294–3302.
[16] Felix Hill, Kyunghyun Cho, and Anna Korhonen. 2016. Learning distributed representations of sentences from unlabelled data. In Proceedings of NAACL-HLT.
[17] Daniel Cer, Yinfei Yang, Shengyi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder. arXiv:1803.11175.
[18] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP-IJCNLP.
[19] Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of EMNLP.
[20] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Advances in Neural Information Processing Systems (NeurIPS).
[21] Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of EMNLP.
[22] Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Proceedings of ACL.
[23] Daniel Jurafsky and James H. Martin. 2023. Speech and Language Processing.
[24] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[25] Lingfeng Wang, Vivek Kulkarni, and Sergio Verdú. 2019. On the intrinsic and extrinsic evaluation of word embeddings. In Proceedings of AAAI.
[26] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS).
[27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.