We present a rule-based framework for parsing Sanskrit sentences by directly leveraging the grammatical rules codified in P??ini’s A???dhy?y?. Our approach generates a parser table – a formal grammar or state-machine representation – derived from P??ini’s nearly 4,000 succinct rules (s?tras). This table forms the backbone of a Sanskrit parser that handles sandhi (euphonic combination), sam?sa (compounding), vibhakti (case and verb inflection), and dh?tu (root verb) processing in an integrated manner. We begin by underscoring the importance of Sanskrit parsing and the suitability of P??ini’s grammar for a rule-based system. A review of existing tools and approaches (e.g., the Sanskrit Heritage Engine, Vidyut-Prakriy?, and Computational Paninian Grammar frameworks) situates our work in the context of prior scholarship. We then detail our methodology for translating P??ini’s s?tras into a parser table, explaining how the parser applies ordered grammatical rules to analyze and segment Sanskrit input. The system’s capabilities are illustrated through examples, including a step-by-step parse of “r?mo’pi g?ham gacchati” (“R?ma also goes home”). In the discussion, we examine the parser’s performance and potential applications – from computational linguistics to digital humanities – and compare its behavior with existing models. We conclude by summarizing the contributions of this approach and outlining future directions, such as addressing ambiguity resolution and extending the framework with statistical methods.
Introduction
Sanskrit parsing is challenging due to its free word order, complex inflectional morphology, and sandhi phenomena that merge word boundaries. A robust parser must segment compounds and sandhi, identify roots and suffixes, and establish syntactic relations despite ambiguity.
P??ini’s A???dhy?y?, an ancient, comprehensive rule-based grammar of about 4,000 sutras, provides an ideal foundation for building such a parser. His grammar uses a formal system of general rules and exceptions, enabling a deterministic rule-based approach that could theoretically cover all grammatical forms with high fidelity.
However, applying P??ini’s generative rules in reverse (parsing) introduces ambiguity since multiple rule sequences can produce the same surface form. This requires strategies to manage ambiguity like rule ordering and filtering.
Prior computational work includes the Sanskrit Heritage Engine (finite-state, lexicon-driven), Vidyut-Prakriy? (rule-based word generator/analyst faithful to P??ini), and frameworks based on P??ini’s k?raka theory for syntactic parsing. These approaches vary in how closely they follow P??ini’s rules and combine rule-based with statistical methods.
The proposed methodology involves automatically generating a parser table from P??ini’s sutras, encoding each rule as a formal machine-readable operation arranged hierarchically to respect general rules and exceptions. The parser table models valid grammatical transitions from roots/stems to fully inflected forms, handling compounds and sandhi internally.
A lexicon of verb roots (dh?tu) and nominal stems (pr?tipadika) is integrated as entry points. Sandhi is addressed both internally (via transitions modeling phonological changes) and externally (via a sandhi splitter that reverses sandhi at word boundaries to generate candidate segmentations). Multiple candidate parses are generated and then pruned or ranked.
This rule-based system aims to produce a comprehensive morphological and syntactic analysis of Sanskrit text, grounded in P??ini’s grammar, filling a gap left by previous tools which handle only parts of the problem. Future directions include integrating statistical methods to resolve residual ambiguities.
Conclusion
In this paper, we have presented a comprehensive rule-based Sanskrit parser that draws its authority from P??ini’s A???dhy?y?. We described how each aspect of Sanskrit grammar – from sandhi to compounding to inflection – can be captured through formal rules and implemented as a parser table for analyzing text. The parser effectively “reverses” P??ini’s generative grammar, using it to deconstruct words into their constituents. Our system demonstrates that P??ini’s grammar is not only a historical artifact but also a practical blueprint for modern computational linguistics, capable of guiding the development of precise language technology for Sanskrit.
The results from our implementation underscore both the strengths and challenges of a rule-based approach. On one hand, the parser achieves high coverage and returns analyses that are linguistically sound and exact, making it a valuable tool for scholars and advanced applications. On the other hand, it produces all theoretically possible interpretations, necessitating further work on disambiguation for real-world usage. Nonetheless, we argue that starting with a rule-based core is worthwhile, especially for Sanskrit, given the availability of a fully specified grammar and the relatively limited size of annotated corpora for statistical methods.
References
[1] Bloomfield, L. (1927). On Some Rules of P?nini. Journal of the American Oriental Society, 47, 61–70. https://doi.org/10.2307/593241
[2] Asher, R. E. (1994). The encyclopedia of language and linguistics (Vol. 2). J. M. Simpson (Ed.). Oxford: Pergamon.
[3] A. Bharati, V. Chaitanya, and R. Sangal, Natural Language Processing: A Paninian Perspective. New Delhi, India: Prentice-Hall of India, 1995. (Introduction of Computational Paninian Grammar for Indian languages)
[4] G. Huet, “Lexicon-Directed Segmentation and Tagging of Sanskrit,” in Sanskrit Computational Linguistics (Lecture Notes in Computer Science 5402), Edited by G. Huet, A. Kulkarni, P. Scharf. Berlin: Springer, 2009, pp. 75–95.
[5] P. Goyal, A. Kulkarni, and L. Behera, “Computer Simulation of P??ini’s A???dhy?y?: Some Insights,” in Sanskrit Computational Linguistics, LNCS 5402, 2009, pp. 139–161.
[6] O. Hellwig, “SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit,” in Sanskrit Computational Linguistics, LNCS 5402, 2009, pp. 266–277.
[7] Kulkarni, A., Pokar, S., & Shukl, D. (2010). Designing a constraint based parser for Sanskrit. In Sanskrit Computational Linguistics: 4th International Symposium, New Delhi, India, December 10-12, 2010. Proceedings (pp. 70-90). Springer Berlin Heidelberg.
[8] K. Raja et al., “Computational Algorithms Based on the Paninian System to Process Euphonic Conjunctions (Sandhi) for Sanskrit Text Search,”IJCSIS, vol. 12, no. 8, pp. 65–74, 2014.
[9] Prasad, A. K. (2024, February). A fast prakriy? generator. In Proceedings of the 7th International Sanskrit Computational Linguistics Symposium (pp. 84-101).
[10] Goyal, P., & Huet, G. (2016). Design and analysis of a lean interface for Sanskrit corpus annotation. Journal of Language Modelling, 4(2), 145-182.
[11] Scharf, P. M. (2015). An XML formalization of the As. t. adhy ay?. In Sanskrit and computational linguistics. select papers presented at the 16th World Sanskrit Conference in the ‘Sanskrit and the IT world’section 28 June–2 July 2015, Sanskrit Studies Center, Silpakorn University, Bangkok (pp. 77-102).
[12] Patel, D., & Katuri, S. (2015). Prakriy?pradar?in?-an open source subanta generator. In Sanskrit and Computational Linguistics-16th World Sanskrit Conference, Bangkok, Thailand.
[13] Sanskrit Heritage Site, INRIA Paris. “Sanskrit Heritage Engine and Reader – Documentation and FAQ,”
[14] Prasad, A. K. (2024, February). A fast prakriy? generator. In Proceedings of the 7th International Sanskrit Computational Linguistics Symposium (pp. 84-101).
[15] P. Goyal et al., “Transliteration and Sanskrit Tokenization (TransLIST),” in Proceedings of ICON 2022, pp. 290–299. (Example of a modern transformer-based Sanskrit tokenizer for comparison with rule-based methods)
[16] Krishna, A., Satuluri, P., & Goyal, P. (2017, August). A dataset for Sanskrit word segmentation. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (pp. 105-114).
[17] Goyal, P., Arora, V., & Behera, L. (2007, October). Analysis of Sanskrit text: Parsing and semantic relations. In International Sanskrit Computational Linguistics Symposium (pp. 200-218). Berlin, Heidelberg: Springer Berlin Heidelberg.
[18] Mishra, A. (2007, October). Simulating the p??inian system of sanskrit grammar. In International Sanskrit Computational Linguistics Symposium (pp. 127-138). Berlin, Heidelberg: Springer Berlin Heidelberg.
[19] Hellwig, O. (2010, December). Performance of a Lexical and POS Tagger for Sanskrit. In International Sanskrit Computational Linguistics Symposium (pp. 162-172). Berlin, Heidelberg: Springer Berlin Heidelberg.
[20] Sandhan, J., Singha, R., Rao, N., Samanta, S., Behera, L., & Goyal, P. (2022). TransLIST: A transformer-based linguistically informed Sanskrit tokenizer. arXiv preprint arXiv:2210.11753.