Malay Proverb Detection; Implementation on Mobile Environment.

This is the list of papers presented in the MobilCase 2012.
1st International Conference on Mobile Learning, Applications, and Services (mobilcase2012) – organised by UTEM, Malaysia.

image


Download the PDF version at –> http://www.scribd.com/doc/115258852/Malay-Proverb-Detection-Implementation-on-Mobile-Environment-MobilCase2012

Khirulnizam Abd Rahman, Norita Md Norwawi
Faculty of Science and Technology
Universiti Sains Islam Malaysia
khirulnizam@gmail.com

Abstract: This paper is to describe the design and implementation of Malay proverbs (peribahasa) automated detection on mobile platform. Proverb is a unique feature of Malay language in which a message or advice is not communicated indirectly through metaphoric phrases. However, the use of proverb will cause confusion or misinterpretation if one does not familiar with the phrases since they cannot be translated literally, yet logically. Therefore, this application was developed to identify proverbs appear in Malay text and display their actual meaning from the dictionary. The recognition process is done using the look-ahead and regular-expression matching approach.

Keywords: proverb automated detection, proverb detection, pattern matching, and information retrieval.

I. INTRODUCTION

Proverbs (peribahasa) in Malay language are beautiful elements to deliver advices, Malay teachings, moral values and comparison through metaphoric phrases [9]. They are normally short, generally known sentence of the folk which contains wisdom, truth, morals, and traditional views in a metaphorical, fixed and memorable form and which is handed down from generation to generation [5]. Although proverbs do beautifies Malay literature, however this brings challenges to machine translation since proverb cannot be translated literally, rather logically [4].

There are four categories of Malay proverbs (as described by Abdullah & Ainon [1]) which are simpulan bahasa, perumpamaan, bidalan and pepatah.

Simpulan bahasa – normally consist of two words. The literal meaning of the word combination is different than the actual meaning of the ‘simpulan bahasa’. Example: Langkah kanan; literally means right footstep, yet the actual meaning is lucky.

Perumpamaan – phrases started with seolah-olah, ibarat, bak, seperti, macam, bagai or laksana. Example: bagaikan pinang dibelah dua; literally means like betel nut split apart evenly, yet the actual meaning is compatible / equally beautiful and handsome for a pair of just married bride and bridegroom.

Pepatah – proverb that contains advices or teachings. Example: Adat berperang, yang kalah jadi abu, menang jadi arang; literally means in war, loser become ashes, winner become coal, yet the actual meaning is in war, the defeated and the winner are both losers.

Bidalan – phrases (pepatah) that started with jangan, biar or ingat. Example: Kalau kail panjang sejengkal, lautan dalam jangan diduga; literally means if you have a short hook, do not attempt to fish in the deep sea, yet the actual meaning is if you have little knowledge, do not dare to dream big.

Technically, simpulan bahasa contains two combined words, and the rest of the categories have more than two words. This number of words criteria is important in the identification process of proverbs.

II. PREVIOUS WORKS

Proverb is one category of multiword expressions (MWE). There’s a research by Rais et. al. [6] indexing the Malay MWE using combination of query translation approach and weighting schemes. The researchers stressed about having a good dictionary are crucial in multiword detection. However, there’s another work by Aiti et. al [2] suggesting several linguistic patterns in MWEs detection. Still the system needs to rely on 412,228 words of Malay corpus.

On other researches directly focusing on proverbs, Brahmaleen et. al. [3] suggested a method in identifying proverbs in Hindi text and translates them into Punjabi. The approach implemented is pattern matching. The machine has a dictionary to map the Hindi proverb and the equivalent proverb in Punjabi.

There is another research from ATMA, UKM whereby they are providing a searchable database of Malay proverbs and idioms [8]. The application receives a complete or a part of proverb/idioms, using the SQL pattern matching to search the list of the proverbs/idioms, and the outputs are all the proverbs/idioms that similar to the user’s request.

Dmitra [4] studied the METIS II translation engine which capable of translating several languages such as Dutch, German, Greek and Spanish into English. The pre-translation process involves tokenization, lemmatization, tagging and chunker. By using the existing METIS II facilities, she managed to embed an idioms translation tool for Germany into English. However, the proverb detection is using the syntactic matching rules approach which is not to be elaborated in this paper.

III. THE DESIGN

The researchers decided to experiment with the combination of look-ahead and regular-expression matching approach approach. By using these approaches, the Malay input text will be tokenized and compared to the proverb database. The searching process is done by combining a token to the next one. This combination of tokens is then searched in the proverb database. The output is the list of proverb detected in the input text, with the meaning of each proverb. The processes are summarized in the Figure 1.

The database consists of 3000 Malay proverbs, resides in the phone itself. This is to minimize bandwidth consumption and also for faster result display. A simple approach is used to identify simpulan bahasa; it is formed by two combined words (Algorithm 1). As for the rest of the Malay proverbs, the Algorithm 2 is implemented. These algorithms are implemented in the proverb detection module of the application, after tokenization (Figure 1).

While not-end-of-sentence

words-combined = wi wi+1

Search words-combine in proverb-database

If found in proverb-database

Put words-combined in the proverb-list- output

i=i+2

Else

i++

Wend

Algorithm 1: To detect simpulan bahasa.

While not-end-of-sentence

words-combined = wi wi+1 wi+2

Search words-combined in proverb-database

If found in proverb-database

Put words-combined in the proverb-list-output

i=i+number-of-words-in-proverb-detected

Else

i++

Wend

Algorithm 2: To detect other proverb (other than simpulan bahasa).

image

Figure 1. Implementation of Malay proverb identification using pattern matching approach.

IV. DISCUSSIONS

These are several challenges in identifying Malay proverbs in text using pattern matching;

a) Word with affixes – Example: “Kembang sayap” = “Mengembangkan sayap” (spread your wings).

b) Another word in between (stopword)

Example: “berpijak di bumi nyata” or sometimes “berpijak di bumi yang nyata” which means “do not day-dreaming”.

The researchers found out that by using simple pattern matching, it failed to detect proverbs that have the aforementioned problems.

Another challenge is to determine the right meaning for proverb that has ambiguous meaning – the same proverbs may have more than one meaning [8]. However in this experiment, all meanings are listed for proverb with ambiguous meaning. For example;

— “mata air” means 1) lover, or 2) underground water resource.

— “orang putih” means 1) pious man, or 2) European (white) people.

— “air muka” means 1) face, 2) pride.

— “bawa diri” means 1) running away, 2) sulking, 3) being independent.

— “bagai cicak makan kapur” means 1) ashamed because of his own offense, 2) pleased.

— “ada air adalah ikan” means 1) there must be people in a country, 2) fortune is everywhere.

V. CONCLUSIONS

Brief testing and observation has been done to the result of this application. The researchers concluded that there are much more to be improved, including introducing another approach. One of the approaches to be studied is the syntactic matching rules proposed by Dmitra [4] for detecting Germany idioms.

This Malay proverb identifier is a prototype to experiment with the pattern matching approach on mobile platform. Though it is still in experimentation stage, the researchers hope that it could contribute to the public by facilitating the new Malay language learner. The application mentioned is available currently for Android platform at http://bit.ly/pbahasa .

VI. REFERENCES

[1] Abdullah Hassan and Ainon Mohd. 2011. Kamus Peribahasa Kontemporari, Edisi Ketiga. PTS Professional Publishing.

[2] Aiti Aw, Sharifah Mahani Aljunied and Haizhou Li. 2009. Malay Multi-word Expression Translation. Second Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST 2009), Suntec, Singapore.

[3] Brahmaleen K. Sidhu, Arjan Singh and Vishal Goyal. 2010. Identification of Proverbs in Hindi Text Corpus and their Translation into Punjabi. Journal of Computer Science and Engineering, Vol. 2, Issue 1, July 2010.

[4] Dmitra Anastasiou. 2010. PhD Thesis, Universitat des Saarlandes. Unpublished.

[5] Mieder, Wolfgang. 2003. Proverbs are Never out of Season. Popular Wisdom in the Modern Age. New York: Oxford University Press.

[6] N.H. Rais, M.T. Abdullah & R.A Kadir, 2011. Multiword Phrases Indexing for Malay-English Cross Language Information Retrieval, Information Technology Journal, 2011.

[7] S.A. Noah and F. Ismail. 2008. Automatic Classifications of Malay Proverbs Using Naive Bayesian Algorithm. Information Technology Journal, 2008.

[8] Supyan Hussin, Ding Choo Ming, Afendi Hamat & Arba’eyah Abdul Rahman. 2004. Kamus Peribahasa Melayu Digital yang Pertama. Sari 22 (2004), 49-61.

[9] Susana Widyastuti. 2010. Peribahasa: Cerminan Kepribadian Budaya Lokal Dan Penerapannya Di masa Kini. Proceeding of Seminar Nasional UTY 3 Juli 2010.

Popular Posts