AI Based Real Time Transcription

Authors: I. Eswari , M. Riyasudeen , G. Thirisan , T. Deepak, H. Salim Mohamed Mahadeer

DOI Link: https://doi.org/10.22214/ijraset.2026.79873

Abstract

Human beings usually rely on communication to express their feeling and ideas and to solve disputes among themselves. A major component required for effective communication is language. Language can occur in different forms, including written symbols, gestures, and vocalizations. It is usually essential for all of the communicating parties to be fully conversant with a common language. However, to date this has not been the case between speech-impaired people who use sign language and people who use spoken languages. A number of different studies have pointed out a significant gaps between these two groups which can limit the ease of communication. Therefore, this study aims to develop an efficient deep learning model that can be used to predict British sign language in an attempt to narrow this communication gap between speech-impaired and non-speech-impaired people in the community. Two models were developed in this research, CNN and LSTM, and their performance was evaluated using a multi-class confusion matrix. The CNN model emerged with the highest performance, attaining training and testing accuracies of 98.8% and 97.4%, respectively. In addition, the model achieved average weighted precession and recall of 97% and 96%, respectively. On the other hand, the LSTM model’s performance was quite poor, with the maximum training and testing performance accuracies achieved being 49.4% and 48.7%, respectively. Our research concluded that the CNN model was the best for recognizing and determining British sign language

Introduction

Sign language is a fully developed natural language used by deaf and hard-of-hearing people, combining hand gestures, facial expressions, and body movements. Despite its rich linguistic structure, communication barriers persist between signers and hearing individuals, creating challenges in education, healthcare, and daily interactions. The paper focuses on developing a deep learning-based system to translate British Sign Language (BSL) into text using CNN and LSTM models, trained through both pre-collected datasets and real-time webcam input using computer vision.

Existing related work shows progress in sign-to-speech systems, speech-to-text optimization, and emotional communication integration, but also highlights limitations such as misinterpretation of facial expressions, ASR errors, and computational inefficiency in advanced learning methods like deep reinforcement learning.

The proposed system processes sign language through stages including character prediction, word formation, suggestion generation, sentence building, and final text output. It compares CNN and LSTM models to identify the most effective approach for real-time translation. The core problem addressed is the lack of scalable, accurate, and real-time sign language translation systems that can replace human interpreters and reduce communication gaps.

Conclusion

The primary approach of this research was to develop two deep learning models, a long short-term memory (LSTM) model and a convolutional neural network (CNN) model and compare their performance. The experiment involved collecting the required datasets, using them to develop the models, training and testing the two models, and applying a multi-class confusion matrix to evaluate their performance. The parameters used for the comparison included training and testing accuracy and the systems’ respective precision and reliability/consistency in predicting sign language. The approach was then divided into two categories; the first used pre-processed data to predict hand gestures for British numerical sign language, and the second used a key points dataset to indicate simple common messages (facial expressions combined with pose signs). In the first approach, both the CNN and LSTM models were developed. The CNN model showed the best performance in all aspects, including accuracy, precision, and reliability, as stated in the research hypothesis. Furthermore, this model showed a positive correlation between training/testing accuracy and the length of the training period as determined by the number of iterations and images per dataset. This resulted in the CNN model attaining high accuracy. A more significant number of signs and more iterations could be applied to increase the training and testing accuracy; the model would then be applicable for accommodating more than one type of sign language, making it more efficient. On the other hand, the LSTM model showed very poor performance in both categories of the experimental approach. This model attained very low accuracy, precision, and consistency in predicting the correct sign based on the multi-class confusion matrix. The most reasonable explanation for the poor performance of the LSTM was due to certain limitations that were pointed out in the literature review. For instance, this model can be difficult to train, as it requires a memory-bandwidth-bound computation which has hardware limitations. LSTM models depend on more complex frameworks to achieve good performance compared to CNN model. This research found that while LSTM models are better in classification of text data, for image data sets more input parameters may be needed. The performance of the LSTM model could be improved by integrating it with other models to curb its limitations. It was not possible for a CNN model to be developed in the second approach, as the dataset used was incompatible with the requirements of a CNN. However, based on the results of the first 34 approach as well as on the literature review, it can be assumed that the CNN model would have performed well in this second approach. Therefore, this research concludes that convolutional neural networks perform better in recognizing and predicting British sign language than LSTM models. In addition, this research further concludes that the CNN model could be used to accommodate more than one set/type of sign language recognition prediction. The findings of this research answer the question of which deep learning models perform better in attempting to narrow the gap between speech impaired people and the general public.

References

[1] Official P. Barve, N. Mutha, A. Kulkarni, Y. Nigudkar, Y. Robert, Application of deep learning techniques on sign language recognition—a survey, in Data Management, Analytics and Innovation ed. by N. Sharma, A. Chakrabarti, V.E. Balas, A.M. Bruckstein. Lecture Notes on Data Engineering and Communications Technologies, vol. 70 (Springer, Singapore, 2021). https://doi.org/10.1007/978-981-16-2934-1_14 [2] A. Patil, A. Kulkarni, H. Yesane, M. Sadani, P. Satav, Literature survey: sign languagerecognition using gesture recognition and natural language processing, in Data Management, Analytics and Innovation, ed. by N. Sharma, A. Chakrabarti, V.E. Balas, A.M. Bruckstein. Lecture Notes on Data Engineering and Communications Technologies, vol. 70 (Springer, Singapore, 2021). https://doi.org/10.1007/978-981-16-2934-1_13 [3] Y. Robert, Y. Nigudkar, A. Kulkarni, N. Mutha, P. Barve, Literature survey: application of machine learning techniques on static sign language recognition, in Innovations in Bio-inspired Computing and Applications. IBICA 2020, ed. by A. Abraham, H. Sasaki, R. Rios, N. Gandhi, U. Singh, K. Ma. Advances in Intelligent Systems and Computing,vol. 1372 (Springer, Cham, 2021). https://doi.org/10.1007/978-3-030-73603-3_16 [4] M. Taskiran, M. Killioglu, N. Kahraman, A real-time system for recognition of American sign language by using deep learning, in 2018 41st International Conference on Telecommunications and Signal Processing (TSP) (2018), pp. 1–5. https://doi.org/10.1109/TSP.2018.8441304 [5] M.R. Islam, U.K. Mitu, R.A. Bhuiyan, J. Shin, Hand gesture feature extraction using deep convolutional neural network for recognizing American sign language, in 2018 4th International Conference on Frontiers of Signal Processing (ICFSP) (2018), pp.115–119. https://doi.org/10.1109/ICFSP.2018.8552044 [6] S. Stoll, N.C. Camgoz, S. Hadfield, R. Bowden, Text2Sign: towards sign language production using neural machine translation and generative adversarial networks. Int. J. Comput. Vision 128(4), 891–908 (2020) 60 [7] R.D. Raj, A. Jasuja, British sign language recognition using HOG, in 2018 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS) (2018), pp. 1–4. https://doi.org/10.1109/SCEECS.2018.8546967 [8] J. Zamora-Mora, M. Chacón-Rivas, Real-time hand detection using convolutional neural networks for Costa Rican sign language recognition, in 2019 International Conference on Inclusive Technologies and Education (CONTIE), 180–1806 (2019). https://doi.org/10.1109/CONTIE49246.2019.00042 [9] C.O. Sosa-Jiménez, H.V. Ríos-Figueroa, E.J. Rechy-Ramírez, A. Marin-Hernandez, A.L.S. González-Cosío, Real-time Mexican sign language recognition, in 2017 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC) (2017), pp. 1–6 [10] Sugandhi, P. Kumar, S. Kaur, Indian sign language generation system. Computer 54(3), 37–46 (2021). https://doi.org/10.1109/MC.2020.2992237 [11] R. San-Segundo, R. Barra, R. Córdoba, L.F. D’Haro, F. Fernández, J. Ferreiros, J.M. Lucas, J. Macías-Guarasa, J.M. Montero, J.M. Pardo, Speech to sign language translation system for Spanish. Speech Commun. 50(11–12), 1009–1020 (2008) [12] B. Saunders, N.C. Camgoz, R. Bowden, Everybody sign now: translating spoken language to photorealistic sign language video (2020). ArXiv preprint arXiv:2011.09846 [13] N. Vasani, P. Autee, S. Kalyani, R. Karani, Generation of Indian sign language by sentence processing and generative adversarial networks, in 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) (2020), pp. 1250– 1255. https://doi.org/10.1109/ICISS49785.2020.9315979 [14] P. Kapoor, R. Mukhopadhyay, S.B. Hegde, V. Namboodiri, C.V. Jawahar, Towards automatic speech to sign language Generation. ArXiv preprint arXiv:2106.12790 (2021) [15] A.C. Duarte, Cross-modal neural sign language translation, in Proceedings of the 27th ACM International Conference on Multimedia (2019), pp. 1650–1654 61 [16] T. Veale, A. Conway, Cross modal comprehension in ZARDOZ, an English to signlanguage translation system, in 4th International Workshop on Natural Language Generation (1994), pp. 67–72 [17] B.D. Patel, H.B. Patel, M.A. Khanvilkar, N.R. Patel, T. Akilan, ES2ISL: an advancement in speech to sign language translation using 3D avatar animator, in 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) (IEEE, 2020), pp. 1–5 [18] N. Mehta, S. Pai, S. Singh, Automated 3D sign language caption generation for video. Univ. Access Inf. Soc. 19(4), 725–738 (2020) [19] Sugandhi, P. Kumar, S. Kaur, Sign language generation system based on Indian sign language grammar. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19(4), Article 54, 26 (2020). https://doi.org/10.1145/3384202 [20] T. Dasgupta, A. Basu, Prototype machine translation system from text-to-Indian sign language, in Proceedings of the 13th International Conference on Intelligent User Interfaces (2008), pp. 313–316.

Copyright

Copyright © 2026 I. Eswari , M. Riyasudeen , G. Thirisan , T. Deepak, H. Salim Mohamed Mahadeer. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79873

Publish Date : 2026-04-10

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here