POS Tagging of Sindhi Language Using Subword Representations and Neural Models

Authors

  • Jagroop Kaur, Himani Gupta, Gurpreet Singh Josan

Keywords:

Part of speech, tagging, word representations, subword representations, Byte Pair Encoding, BiLSTM, neural network

Abstract

Part-of-Speech (POS) tagging is a fundamental Natural Language Processing (NLP) task that facilitates a wide range of downstream applications. While extensive research exists for resource-rich languages such as English and Chinese, morphologically rich and low-resource languages like Sindhi remain underexplored. This paper presents POS tagging models for Sindhi using word-level, character-level, joint word-character, and subword-level representations. To address challenges such as ambiguity, semantic preservation, and out-of-vocabulary (OOV) words, we employ Byte Pair Encoding (BPE) based subword representations in combination with Bidirectional Long Short-Term Memory (BiLSTM) networks. Two classifier settings are evaluated: a Dense layer and a Conditional Random Field (CRF) layer. Experiments are conducted on publicly available Sindhi datasets (SiPOS and Dootio-Wagan), with Dataset-1 used for training and Dataset-2 for evaluation. Results show that joint word-character BiLSTM-CRF achieves the highest accuracy (90%), while the proposed BPE-based subword BiLSTM-Dense model achieves 88%, outperforming the subword BiLSTM-CRF at 86%. These findings demonstrate that subword representations effectively handle OOV and morphological complexity while retaining semantic information. The proposed models enrich Sindhi computational resources and highlight promising directions for future work, including training Sindhi-specific BPE embeddings and exploring transformer-based architectures such as RoBERTa and GPT-2 for improved accuracy.

Downloads

Download data is not yet available.

References

Ali, w., xu, z., and kumar, j. 2021. Sipos: a benchmark dataset for sindhi part-of- speech tagging. In proceedings of the student research workshop associated with ranlp 2021 , 22-30.

Brill, e. 1992. A simple rule-based part of speech tagger . Proceedings of the third conference on applied computational linguistics (acl), trento .

Cliche, a., and yitagesu, b. 2022. Part of speech tagging: a systematic review of machine learning and deep learning approaches. Journal of big data , 9 (1).

D.nawaz, awan, s. A., bhotto, z. A., memon, m., and hameed, m. 2017. Handling ambiguities in sindhi named entity recognition(ner). Sindhi university research journal(science series) , 49 (3), 513-516.

Dootio, m. A., and wagan, a. I. 2018. Unicode-8 based linguistics data set of annotated sindhi text. Data in brief , 19, 1504-1514.

Gage, p. 1994. A new algorithm for data compression. C user journal , 12 (2), 23-28.

Heinzerling, b., and strube, m. 2018. Bpemb: tokenization-free pre-trained subword embeddings in 275 languages. Proceedings of the eleventh international conference on language resources and evaluation ({lrec} 2018). Miyazaki, japan: european language resources association (elra).

Huang, z., xu, w., & yu, k. (2015). Bidirectional lstm-crf models for sequence tagging. Proceedings of the 2015 conference on empirical methods in natural language processing (emnlp), pp. 2261–2270.

Kingma, d. P., and ba, j. 2015. Adam: a method for stochastic optimization. 3rd international conference for learning representations. San diego: arxiv.

Lafferty, j., mccallum, a., and pereira, f. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. Proceedings of the eighteenth international conference on machine learning, (pp. 282-289).

Mahar, j. A., and memon, g. Q. 2010. Sindhi part of speech tagging system using wordnet. International journal of computer theory and engineering , 2 (4), 1793-8201.

Motlani, r., lalwani, h., sharma, d. M., and shrivastava, m. 2015. Developing part-of-speech tagger for a resource poor language: sindhi. In proceedings of 7th conference on language and technology,ponzan,poland .

Patel, c., and gali, k. 2008. Part-of-speech tagging for gujarati using conditional random fields. In proceedings of the ijcnlp-08 workshop on nlp for less privileged languages.

Santos, c. D., and zadrozny, b. 2014. Learning character-level representations for part-of-speech tagging. Proceedings of the 31st international conference on machine learning , 32 (2), 1818-1826.

Schuster, m., and paliwal, k. K. 1997. Bidirectional recurrent neural networks. Ieee trans.signal process. , 45, 2673-2681.

Sha, f., and pereira, f. 2003. Shallow parsing with conditional random fields. Proceedings of the 2003 human language technology conference of the north american chapter of the association for computational linguistics, (pp. 213-220).

Sodhar, i., jalbani, a., channa, m., and hakro, d. 2021. Romanized sindhi rules for text communication. Mehran university research journal of engineering and technology, 40(2), 298 - 304. Doi:10.22581/muet1982.2102.04

Warjari, s., pakray, p., lingdoh, s. A., and maji, a. K. 2021. Part-of-speech (pos) tagging using deep learning-based approaches on the designed khasi pos corpus. Acm trans. Asian low-resour. Lang. Inf. Process. , 21 (3), 2375-4699.

Downloads

Published

20.02.2023

How to Cite

Jagroop Kaur. (2023). POS Tagging of Sindhi Language Using Subword Representations and Neural Models. International Journal of Intelligent Systems and Applications in Engineering, 11(2), 1057–1062. Retrieved from https://www.ijisae.org/index.php/IJISAE/article/view/7893

Issue

Section

Research Article