Want to know the existing corpora developed for corpus research and language corpora in Malaysia? Checkout the basic information of available corpora in Malaysia provided by their compilers. This corpora database is an open-access online resource you can use as reference. Request for these corpora should be made directly to the compilers. While the copyright of each of the corpus belongs to their respective compilers, the information found on this page is maintained by the Corpora managers of the Malaysian Corpus Research Network (MCRN) (Dr Syamimi Turiman, Ms. Christina Ong & Mr Alex Chang Li Xin).
This database is maintained as a record of the existing corpora developed in Malaysia. Do you have a corpus (Malaysian-based language data) and want us to promote it for you here? Drop us a line at mcrn.web@gmail.com.
If you would like your corpora to be included in this database, please fill in the form. Please note that the list of corpora will be updated once every 6 months.
For those who would like to request for the corpus/corpora please contact the compilers directly or the person-in-charge (see first and second column of the table below, person-in-charge indicated by an *). Please submit this form along with your email to the compilers to avoid any problems and legal repercussions :
(Note that it is the responsibility of the requester and compiler/s to keep this form to avoid any legal issue later in the future).
CORPORA
MALAY LANGUAGE CORPORA
| orpus (+link) | Compiler/s | #Words | Written and/or spoken | Time period | Genre |
| DBP | Dewan Bahasa dan Pustaka | 137 million words | Written texts | – | “Enam subkorpus iaitu buku, majalah, akhbar, efemeral, teks tradisional dan kertas kerja” |
| KoBK Melayu (Bahasa Kemurungan di FB) | Dr. Rozaimah Rashidin* Cik Nor Umiumairah Mohamad Puan Umaimah Kamarulzaman (rozai451@uitm.edu.my) | 129,146 | Written texts | January 2020 to September 2020 | Media Social Facebook |
| COVID-19_Astro Awani | Dr. Rozaimah Rashidin* Cik Nor Umiumairah Mohamad Puan Umaimah Kamarulzaman (rozai451@uitm.edu.my) | 287,477 | Written Texts | September 2020 to October 2020 | Akbar dalam Talian |
| Hansard | Dr. Imran Ho Abdullah Ms. Anis Nadiah Che Abdul Rahman Dr. Azhar Jaludin | 165,061,050 | Verbatim + Reporting | 2007 to 2020 | Political Discourse/Parliamentary Corpus |
ENGLISH LANGUAGE CORPORA
| Corpus (+link) | Compiler/s | #Words | Written and/or spoken | Time period | Genre |
| MOSNEC | Dr. Tan Kim Hua* (kimmy@ukm.edu.my) | Approximately 10 million words | Written | 2015 | Sports News |
| EMAS | Dr. Arshad Samad | 472,458 | Written Texts | 2000 to 2002 | Picture Essay Writings by Malaysian Students |
| ICE Malaysia | Dr. Hajar Abdul Rahim* Dr. Ang Leng Hong (hajar@usm.my) | Over 1/4 of 1 million words | Written and Spoken | – | Various genres |
| Malaysian Corpus of Financial English (MaCFE) | Dr. Roslina Abdul Aziz, Roslan Sadjirin, Norzie Diana Baharum, Noli Maishara Nordin & Mohd Rozaidi Ismail | 4,373,230 | Written | 2009 to 2013 | Genre specific |
| MUET Corpus | Dr. Noorli Khamis* (noorli@utem.edu.my) | 1,479,218 | Written | 1999 to 2020 | Genre specific, POS tagged |
| Learner Medical Oral Case Presentations (LMOCP)** | Dr. Afida Mohamad Ali | 310,688 | Spoken | 2018 to 2019 | Genre specific, Annotation |
* Denotes Person-In-Charge/Contact Person
** Denotes Not Available for Public Use Corpora