Researchers roll out multilingual tool for Indic languages

Created On: 25 Sept 2023 8:47 AM IST

Researchers roll out multilingual tool for Indic languages

Being a linguistically diverse country, India’s native languages lack tools for annotated resources and automated systems for Indic languages Open Information Extraction (OIE) involves the extraction of valuable facts from natural language text in open domains.

Hyderabad: Being a linguistically diverse country, India’s native languages lack tools for annotated resources and automated systems for Indic languages Open Information Extraction (OIE) involves the extraction of valuable facts from natural language text in open domains. One such multilingual OIE tool, known as IndIE, has been developed by five researchers, Ritwik Mishra and Rajiv Ratn Shah (IIIT Delhi), Simranjeet Singh (NSUT Delhi), PonnurangamKumaraguru (IIIT Hyderabad), Pushpak Bhattacharya (IIT Bombay).

To evaluate its effectiveness in the Hindi language, a benchmark called Hindi-BenchIE has been established for the automated assessment of Hindi triples. IndIE has been systematically evaluated against the golden triples extracted from 112 Hindi sentences.

Based on its demonstrated ability to generalise the developed chunker across various natural languages and the utilisation of triple generation rules rooted in common dependency relations shared by the Indic languages like Urdu, Tamil, and Telugu, it is plausible to conjecture that IndIE possesses the potential to generate meaningful triples for sentences in these languages as well.

Speaking to The Hans India, Professor PonnurangamKumaraguru, International Institute of Information Technology, Hyderabad, says, “The number of languages supported by IndIE is limited by the intersection of (languages supported by stanza library) with (languages with significant dependency overlap with Hindi dependency relations), that gives us the following Indic languages: Tamil, Telugu, and Urdu. We don’t recommend IndIE for Marathi because stanza dependency parser tool for Marathi is trained on such data where Marathi words/tokens were further tokenised/broken. The primary reason we argue that IndIE will work for Tamil, Telugu, and Urdu is because of high (~96 per cent) overlap of dependency relations between them and Hindi.”

Most common/popular are subject and object dependency relations. When we say subject we meant all its further inflections like nsubj, nsubj:pass, and others. Similarly, for objects. Both of these dependency relations help us identify the head and tail of a triple.

The potential applications of IndIE are similar to the potential applications of any other OIE tool or triple extractor tool i.e. triple extraction is considered as the first step in building a Knowledge Graph (KG) out of unstructured text. Adding further, Professor Kumaraguru says, “There are many OIE methods for English due to which we have a variety of KGs for English.

Since Indic languages don’t have such tools publicly available, our work fills in the gap. A major implication of our work is to be one stop for all the multilingual OIE methods.

Moreover, we are positive that IndIE will initiate many other works in the direction of multilingual OIE.”