Corpora Database

Want to know the existing corpora developed for corpus research and language corpora in Malaysia? Checkout the basic information of available corpora in Malaysia provided by their compilers. This corpora database is an open-access online resource you can use as reference. Request for these corpora should be made directly to the compilers. While the copyright of each of the corpus belongs to their respective compilers, the information found on this page is maintained by the Corpora managers of the Malaysian Corpus Research Network (MCRN) (Dr Syamimi Turiman, Ms. Christina Ong & Mr Alex Chang Li Xin).

This database is maintained as a record of the existing corpora developed in Malaysia. Do you have a corpus (Malaysian-based language data) and want us to promote it for you here? Drop us a line at mcrn.web@gmail.com.

If you would like your corpora to be included in this database, please fill in the form. Please note that the list of corpora will be updated once every 6 months.

https://docs.google.com/forms/d/1hmS6iKLGJuGZtibZgWBfarMAwYfiUawPMoh7Dm14-xA/viewform?edit_requested=true

For those who would like to request for the corpus/corpora please contact the compilers directly or the person-in-charge (see first and second column of the table below, person-in-charge indicated by an *). Please submit this form along with your email to the compilers to avoid any problems and legal repercussions : 
(Note that it is the responsibility of the requester and compiler/s to keep this form to avoid any legal issue later in the future).

CORPORA

MALAY LANGUAGE CORPORA

orpus (+link)Compiler/s#WordsWritten and/or spokenTime periodGenre
DBP Dewan Bahasa dan Pustaka137 million wordsWritten texts“Enam  subkorpus iaitu buku, majalah, akhbar, efemeral, teks tradisional dan kertas kerja”
KoBK Melayu (Bahasa Kemurungan di FB)Dr. Rozaimah Rashidin*
Cik Nor Umiumairah Mohamad
Puan Umaimah Kamarulzaman
(rozai451@uitm.edu.my)
129,146Written textsJanuary 2020 to September 2020Media Social Facebook
COVID-19_Astro AwaniDr. Rozaimah Rashidin*
Cik Nor Umiumairah Mohamad
Puan Umaimah Kamarulzaman
(rozai451@uitm.edu.my)
287,477Written TextsSeptember 2020 to October 2020Akbar dalam Talian
HansardDr. Imran Ho Abdullah
Ms. Anis Nadiah Che Abdul Rahman
Dr. Azhar Jaludin
165,061,050Verbatim + Reporting2007 to 2020Political Discourse/Parliamentary Corpus

ENGLISH LANGUAGE CORPORA

Corpus (+link)Compiler/s#WordsWritten and/or spokenTime periodGenre
MOSNECDr. Tan Kim Hua*
(kimmy@ukm.edu.my)
Approximately 10 million wordsWritten2015Sports News
EMASDr. Arshad Samad472,458Written Texts2000 to 2002Picture Essay Writings by Malaysian Students
ICE MalaysiaDr. Hajar Abdul Rahim*
Dr. Ang Leng Hong
(hajar@usm.my)
Over 1/4 of 1 million wordsWritten and SpokenVarious genres
Malaysian Corpus of Financial English (MaCFE)Dr. Roslina Abdul Aziz, Roslan Sadjirin,
Norzie Diana Baharum, Noli Maishara Nordin & Mohd Rozaidi Ismail
4,373,230Written2009 to 2013Genre specific
MUET CorpusDr. Noorli Khamis*
(noorli@utem.edu.my)
1,479,218Written1999 to 2020Genre specific, POS tagged
Learner Medical Oral Case Presentations (LMOCP)**Dr. Afida Mohamad Ali310,688Spoken2018 to 2019Genre specific, Annotation

* Denotes Person-In-Charge/Contact Person
** Denotes Not Available for Public Use Corpora