Somali Corpus: state of the art, and tools for linguistic analysis
Dr. Jama Musse Jama
Developing IT resources for language mainly focuses on well-described languages with long-standing written traditions and with a large number of speakers. One of the main challenges for the languages with more recent written traditions is the lack of enough data for successful statistical approaches. This descriptive paper aims to present the state of the art of the construction of the Redsea Cultural Foundation’s Somali Corpus (RCF-SC), and the development of a series of computer programs with which to analyze the corpus data for various purposes. With almost 3 million words tagged (out of the 10 million words collected so far), 1100 published works parsed and indexed, over 52000 head words collected and defined, almost 6 million inflected forms of nouns and verbs generated, and over 10 thousand translations in 4 languages (English, Italian, France and Swedish), the Somali Corpus becomes one of the world’s largest, growing, collection of Somali online corpora for linguistic research. The core of RCF-SC is unique in Somali speaking countries and wants to be, for Somali, a resource equivalent in quality to the British National Corpus. The first edition of the corpus is online at www.somalicorpus.com.