Text Preprocessing for Amharic
When working with NLP, preprocessing text is one of important process to get clean and formatted data before passing it to the model. Most of resourced languages suach as English and other European countries has tools such as NLTK, that allows to perform text preprocessing , but the same history is not true for Amharic. Amharic is an offical language of the Ethiopian government spoken by more than 100M people arroung the world and all over Ethiopia. Amharic script is not latin it uses geez script and this made the steps a bit challenging to use.
The aim of this notebook is to support researchers working in NLP tasks for Amharic. The following preprocessing steps are included:
- Short form expansion
- Multi-word detection
- Character level miss-match normalization
- Number mismatch normalization
Short Form Expansion and Character Level Normalization
To deal multi-word short form representation, the list of short forms in Amharic language are consulted to expand a short form expression to its long form. For example, ትምህርት ቤት can also be represented as ት/ቤት in Amharic text.
import re
class normalize(object):
expansion_file_dir='' # assume you have file with list of short forms with their expansion as gazeter
short_form_dict={}
# Constructor
def __init__(self):
self.short_form_dict=self.get_short_forms()
def get_short_forms(self):
text=open(self.expansion_file_dir,encoding='utf8')
exp={}
for line in iter(text):
line=line.strip()
if not line: # line is blank
continue
else:
expanded=line.split("-")
exp[expanded[0].strip()]=expanded[1].replace(" ",'_').strip()
return exp
# method that expand short form file
def expand_short_form(self,input_short_word):
if input_short_word in self.short_form_dict:
return self.short_form_dict[input_short_word]
else:
return input_short_word
The following function performs character level mismatch normalization task. Amharic has different characters that are interchangeably used in writing and reading such as (ሀ, ኀ, ሐ, and ኸ), (ሰ and ሠ), (ጸ and ፀ), (ው and ዉ) and (አ and ዓ). For example, ጸሀይ to mean sun can also be written as ፀሐይ. In addition, Amharic words with suffix such as ቷል are also written as ቱዋል. So, I will normalize any character under such category to common canonical representation.
def normalize_char_level_missmatch(input_token):
rep1=re.sub('[ሃኅኃሐሓኻ]','ሀ',input_token)
rep2=re.sub('[ሑኁዅ]','ሁ',rep1)
rep3=re.sub('[ኂሒኺ]','ሂ',rep2)
rep4=re.sub('[ኌሔዄ]','ሄ',rep3)
rep5=re.sub('[ሕኅ]','ህ',rep4)
rep6=re.sub('[ኆሖኾ]','ሆ',rep5)
rep7=re.sub('[ሠ]','ሰ',rep6)
rep8=re.sub('[ሡ]','ሱ',rep7)
rep9=re.sub('[ሢ]','ሲ',rep8)
rep10=re.sub('[ሣ]','ሳ',rep9)
rep11=re.sub('[ሤ]','ሴ',rep10)
rep12=re.sub('[ሥ]','ስ',rep11)
rep13=re.sub('[ሦ]','ሶ',rep12)
rep14=re.sub('[ዓኣዐ]','አ',rep13)
rep15=re.sub('[ዑ]','ኡ',rep14)
rep16=re.sub('[ዒ]','ኢ',rep15)
rep17=re.sub('[ዔ]','ኤ',rep16)
rep18=re.sub('[ዕ]','እ',rep17)
rep19=re.sub('[ዖ]','ኦ',rep18)
rep20=re.sub('[ጸ]','ፀ',rep19)
rep21=re.sub('[ጹ]','ፁ',rep20)
rep22=re.sub('[ጺ]','ፂ',rep21)
rep23=re.sub('[ጻ]','ፃ',rep22)
rep24=re.sub('[ጼ]','ፄ',rep23)
rep25=re.sub('[ጽ]','ፅ',rep24)
rep26=re.sub('[ጾ]','ፆ',rep25)
#Normalizing words with Labialized Amharic characters such as በልቱዋል or በልቱአል to በልቷል
rep27=re.sub('(ሉ[ዋአ])','ሏ',rep26)
rep28=re.sub('(ሙ[ዋአ])','ሟ',rep27)
rep29=re.sub('(ቱ[ዋአ])','ቷ',rep28)
rep30=re.sub('(ሩ[ዋአ])','ሯ',rep29)
rep31=re.sub('(ሱ[ዋአ])','ሷ',rep30)
rep32=re.sub('(ሹ[ዋአ])','ሿ',rep31)
rep33=re.sub('(ቁ[ዋአ])','ቋ',rep32)
rep34=re.sub('(ቡ[ዋአ])','ቧ',rep33)
rep35=re.sub('(ቹ[ዋአ])','ቿ',rep34)
rep36=re.sub('(ሁ[ዋአ])','ኋ',rep35)
rep37=re.sub('(ኑ[ዋአ])','ኗ',rep36)
rep38=re.sub('(ኙ[ዋአ])','ኟ',rep37)
rep39=re.sub('(ኩ[ዋአ])','ኳ',rep38)
rep40=re.sub('(ዙ[ዋአ])','ዟ',rep39)
rep41=re.sub('(ጉ[ዋአ])','ጓ',rep40)
rep42=re.sub('(ደ[ዋአ])','ዷ',rep41)
rep43=re.sub('(ጡ[ዋአ])','ጧ',rep42)
rep44=re.sub('(ጩ[ዋአ])','ጯ',rep43)
rep45=re.sub('(ጹ[ዋአ])','ጿ',rep44)
rep46=re.sub('(ፉ[ዋአ])','ፏ',rep45)
rep47=re.sub('[ቊ]','ቁ',rep46) #ቁ can be written as ቊ
rep48=re.sub('[ኵ]','ኩ',rep47) #ኩ can be also written as ኵ
return rep48
replacing any existance of special character or punctuation to null Amharic puncutation marks: =፡።፤;፦፧፨፠፣
def remove_punc_and_special_chars(text):
normalized_text = re.sub('[\!\@\#\$\%\^\«\»\&\*\(\)\…\[\]\{\}\;\“\”\›\’\‘\"\'\:\,\.\‹\/\<\>\?\\\\|\`\´\~\-\=\+\፡\።\፤\;\፦\፥\፧\፨\፠\፣]', '',text)
return normalized_text
#remove all ascii characters and Arabic and Amharic numbers
def remove_ascii_and_numbers(text_input):
rm_num_and_ascii=re.sub('[A-Za-z0-9]','',text_input)
return re.sub('[\'\u1369-\u137C\']+','',rm_num_and_ascii)
Multi-word detection using with collocation finder
In natural language token can be formed from single or multi words. Thus, in order to consider those tokens formed from multi-words, component that dedicated to detect their existence is highly required under preprocessing stage. First I have tokenized the each sentences into list of tokens.
In Amharic, the individual words in a sentence are separated by two dots (: ሁለትነጥብ). The end of a sentence is marked by Amharic full stop (። አራት ነጥብ). The symbol (፣ ነጠላ ሰረዝ) represents a comma, while (፤ ድርብ ሰረዝ) correspond to a semicolon. ‘!’ and ‘?’ punctuations are used to end exclamatory and interogative sentence respectively.
Then using n-gram multi word detection approach, multiwords are detected.
- The first process in this component is forming all possible bi-grams from tokenized input text.
- Next, chi-square computation is applied to detect multi-words from the possible bigrams those their chi-square value is greater than experimentally chosen threshold value.
from nltk import BigramCollocationFinder
import nltk.collocations
import io
import re
import os
class normalize(object):
def tokenize(self,corpus):
print('Tokenization ...')
all_tokens=[]
sentences=re.compile('[!?።(\፡\፡)]+').split(corpus)
for sentence in sentences:
tokens=sentence.split() # expecting non-sentence identifies are already removed
all_tokens.extend(tokens)
return all_tokens
def collocation_finder(self,tokens,bigram_dir):
bigram_measures = nltk.collocations.BigramAssocMeasures()
#Search for bigrams with in a corpus
finder = BigramCollocationFinder.from_words(tokens)
#filter only Ngram appears morethan 3+ times
finder.apply_freq_filter(3)
frequent_bigrams = finder.nbest(bigram_measures.chi_sq,5) # chi square computer
print(frequent_bigrams)
PhraseWriter = io.open(bigram_dir, "w", encoding="utf8")
for bigram in frequent_bigrams:
PhraseWriter.write(bigram[0]+' '+bigram[1] + "\n")
def normalize_multi_words(self,tokenized_sentence,bigram_dir,corpus):
#bigram_dir: is the directory to store multi-words
bigram=set()
sent_with_bigrams=[]
index=0
if not os.path.exists(bigram_dir):
self.collocation_finder(self.tokenize(corpus),bigram_dir)
#calling itsef
self.normalize_multi_words(tokenized_sentence,bigram_dir,corpus)
else:
text=open(bigram_dir,encoding='utf8')
for line in iter(text):
line=line.strip()
if not line: # line is blank
continue
else:
bigram.add(line)
if len(tokenized_sentence)==1:
sent_with_bigrams=tokenized_sentence
else:
while index <=len(tokenized_sentence)-2:
mword=tokenized_sentence[index]+' '+tokenized_sentence[index+1]
if mword in bigram:
sent_with_bigrams.append(tokenized_sentence[index]+''+tokenized_sentence[index+1])
index+=1
else:
sent_with_bigrams.append(tokenized_sentence[index])
index+=1
if index==len(tokenized_sentence)-1:
sent_with_bigrams.append(tokenized_sentence[index])
return sent_with_bigrams
Normalize Geez and Arabic Number Mismatch
This code snippet allows you to expand decimal form numbers to text representation. It also automatically normalize arabic numbers to Geez form. For example, 1=፩, 2=፪, …
def arabic2geez(arabicNumber):
ETHIOPIC_ONE= 0x1369
ETHIOPIC_TEN= 0x1372
ETHIOPIC_HUNDRED= 0x137B
ETHIOPIC_TEN_THOUSAND = 0x137C
arabicNumber=str(arabicNumber)
n = len(arabicNumber)-1 #length of arabic number
if n%2 == 0:
arabicNumber = "0" + arabicNumber
n+=1
arabicBigrams=[arabicNumber[i:i+2] for i in range(0,n,2)] #spliting bigrams
reversedArabic=arabicBigrams[::-1] #reversing list content
geez=[]
for index,pair in enumerate(reversedArabic):
curr_geez=''
artens=pair[0]#arrabic tens
arones=pair[1]#arrabic ones
amtens=''
amones=''
if artens!='0':
amtens=str(chr((int(artens) + (ETHIOPIC_TEN - 1)))) #replacing with Geez 10s [፲,፳,፴, ...]
else:
if arones=='0': #for 00 cases
continue
if arones!='0':
amones=str(chr((int(arones) + (ETHIOPIC_ONE - 1)))) #replacing with Geez Ones [፩,፪,፫, ...]
if index>0:
if index%2!= 0: #odd index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_HUNDRED)) #appending ፻
else: #even index
curr_geez=amtens+amones+ str(chr(ETHIOPIC_TEN_THOUSAND)) # appending ፼
else: #last bigram (right most part)
curr_geez=amtens+amones
geez.append(curr_geez)
geez=''.join(geez[::-1])
if geez.startswith('፩፻') or geez.startswith('፩፼'):
geez=geez[1:]
if len(arabicNumber)>=7:
end_zeros=''.join(re.findall('([0]+)$',arabicNumber)[0:])
i=int(len(end_zeros)/3)
if len(end_zeros)>=(3*i):
if i>=3:
i-=1
for thoushand in range(i-1):
print(thoushand)
geez+='፼'
return geez
def getExpandedNumber(self,number):
if '.' not in str(number):
return arabic2geez(number)
else:
num,decimal=str(number).split('.')
if decimal.startswith('0'):
decimal=decimal[1:]
dot=' ነጥብ ዜሮ '
else:
dot=' ነጥብ '
return arabic2geez(num)+dot+self.arabic2geez(decimal)
Your comments are my teacher. So drop any comments.