Somali Corpus: state of the art, and tools for linguistic analysis

0
Jama Musse Jama (aka Jaamac Muuse Jaamac) is an ethno-mathematician and author. He has a PhD in African Studies specializing in Computational Linguistics of African Languages. He has created and currently directing Somali Corpus, an online platform to manage corpus database for Somali language (see www.somalicorpus.com). Dr. Jama Musse is also notable for his research on traditional Somali board games and other African indigenous knowledge to improve basic education and development in Africa. He is founder of the Redsea Cultural Foundation, the organisation behind the Hargeysa International Book Fair, which has become one of the most important literary gatherings and book celebrations in East Africa. Jama Musse is now leading the establishment of the Hargeysa Cultural Centre in Somaliland.
Dr. Jama Musse Jama
Abstract
Developing IT resources for language mainly focuses on well-described languages with long-standing written traditions and with a large number of speakers. One of the main challenges for the languages with more recent written traditions is the lack of enough data for successful statistical approaches. This descriptive paper aims to present the state of the art of the construction of the Redsea Cultural Foundation’s Somali Corpus (RCF-SC), and the development of a series of computer programs with which to analyze the corpus data for various purposes. With almost 3 million words tagged (out of the 10 million words collected so far), 1100 published works parsed and indexed, over 52000 head words collected and defined, almost 6 million inflected forms of nouns and verbs generated, and over 10 thousand translations in 4 languages (English, Italian, France and Swedish), the Somali Corpus becomes one of the world’s largest, growing, collection of Somali online corpora for linguistic research. The core of RCF-SC is unique in Somali speaking countries and wants to be, for Somali, a resource equivalent in quality to the British National Corpus. The first edition of the corpus is online at www.somalicorpus.com.
DOWNLOAD THE FULL PAPER

Leave a Reply