MOTEC: THE MALAY OFFENSIVE TEXT CLASSIFICATION USING EXTRA TREE AND DIALECTAL STANDARDIZATION
Main Article Content
Abstract
Cyberbullying has increased globally, with offensive text contributing significantly. Detecting of-fensive text in the Malay language is challenging due to non-standard Malay text, unique social media writing styles, lack of standardization, and limited resources. This study proposes the Malay Offensive Text Classification (MOTEC) framework to address these challenges. The MOTEC framework incorporates a Malay standardization preprocessing task, utilizing three specialized dictionaries: (a) abbreviations, (b) noisy text, and (c) Malaysian dialects. This approach enhances data quality by converting non-standard text into standardized Malay sentences before classifica-tion. For feature extraction, the framework employs Term Frequency-Inverse Document Frequency (TF-IDF) coupled with an Extra Tree classifier for the classification process. Evaluating the MOTEC framework using a private dataset collected from Twitter, we achieved a classification accuracy of 94%, significantly outperforming other studies, which reported an accuracy of 84%. The MOTEC framework substantially improves the classification of offensive Malay text by enhancing accuracy, reducing execution time, and improving data quality through effective language standardization.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.