I offer a Natural Language Processing (NLP) tool set for Romani, for language professionals, researchers, and students. The tool set comprises automatic morphological analysers, disambiguators, spellers, and online dictionaries. Disambiguators and spellers depend on automatic morphological analysers. My products are open source and licensed under a GNU Lesser General Public License v3.0. Source code will be always provided together with the products. Please let me know what you are interested in, what you would wish to have. TBA: Software demos will be added soon on this page.
Automatic morphological analysers
I implement two-level (TWOL, developed by Kimmo Koskenniemi at the University of Helsinki) automatic morphological analysers for Romani dialects using the Helsinki Finite-State Transducer (HFST) software, which is a programming library and set of utilities for natural language processing with finite-state automata and finite-state transducers. HFST has been used for writing various linguistic tools, such as spell-checkers, hyphenators, and morphologies. The HFST software is licensed under a GNU Lesser General Public License v3.0.
Automatic morphological analysers are built upon existing language documentation (dictionaries, grammars). TWOL-based analysers comprise a lexicon and a rule component. The lexicon component represents the morphotactic description of the language. It contains deep structures of morphemes and information on how they can be strung together or concatenated. The rule component corresponds to the morphophonemic description of a language, telling which ways each deep form of a morpheme can be transformed according to the phonological environment.
I produce disambiguators for Romani to improve the results of automatic morphological analysis. The goal of a disambiguator is to choose between different analyses or readings of a word delivered by a morphological analyser based on a set of grammatical rules. I use Eckhard Bick’s and Tino Dideriksen’s Vislcg3 for building up disambiguators. Vislcg3 is a compiler for Constraint Grammar (CG), which is a grammar parser. CG was originally developed by Fred Karlsson at the University of Helsinki. Vislcg3 is open source and licensed under GPL.
I create responsive online dictionaries from existing lexical sources of Romani dialects. The dictionaries can be compiled to be able to interact with lexicon components of automatic morphological analysers. The layouts and user interfaces to browse, complement, and maintain the online dictionaries are always designed in collaboration with the customer. Publishing dictionaries using a wiki platform will facilitate croudsourcing.