NLP and support in Hebrew: converting and deciphering for voice and text
1 January 2022
Natural Language Processing (NLP) is a sub-genre of AI and linguistics which deals with issues regarding the comprehension, interpretation and processing of language to make computers “understand” content said or written in human language.
An outline of the problem
When adapting the language and machine learning to Hebrew, it was found that converting speech in Hebrew to text is performed reasonably; the complex problem they face is the difficulty in deciphering the meaning of texts:
Due to the Hebrew syntax and grammar differ greatly from those of languages currently used in NLP modules, one cannot use existing modules cannot be used directly and expect satisfying results. For example, Hebrew does not have any vowels and three or four-lettered words can be read in several different ways. Furthermore, unlike English, the order of the words in a sentence in Hebrew is irrelevant to its meaning.
The machines also struggle to understand the meaning of sentences and place words in their correct context.
Since there are relatively very few Hebrew speakers, despite some of these companies’ development centers being located in Israel, the economic justification for investing in this issue is clearly insufficient.
Attempts at a solution
One way to solve the problem was implemented by the ICT Authority, which began in 2020 to actively create a database to be used to train machines to understand the Hebrew language, that will be available to governmental services, start-ups, and large corporations.
The database – “a tagged manual corpus of Contemporary Hebrew” – is a database of sentences in Hebrew defined by its Dictionary entry, parts of speech (nouns, verbs, etc.), and syntactic entities (subject of sentence, etc.).
After feeding many sentences into it via machine learning, the program will be able to begin to address sentences that weren’t previously fed into it. In 2021, the project won a special certificate of appreciation by the panel of judges at the Bureau of Information Technology in Israel conference.
Another approach researchers implement, in order to develop Hebrew-speaking voice assistants (like the English-speaking Alexa and Siri) is developing computerized modules based on Big Data. These modules process words in context, in relation to other words preceding and following them in the sentence, and are based on processing large amounts of data. One example of this is Google’s Bert (Bidirectional Encoder Representation from Transformers) launched in 2019. BERT helps Google with NLP and by simulating an artificial neuron network it allows a better understanding of the query’s different words’ context, especially in cases of double meaning. One of the innovations BERT offers is the ability to understand the links between the words near – but not necessarily immediately adjacent – words. In other words, it can even guess missing words.
In conclusion, despite objective difficulties derived from the properties of Hebrew/Semitic languages compared to other languages, it seems that with technology’s accelerated advance nowadays, and hopefully considering that besides approximately ten million Hebrew speakers there about three hundred million Arabic speakers, we will be able to soon enjoy voice/textual chatbots that fully and sufficiently support in Hebrew, as well.