EVALUATING SIMILARITY MEASURES FOR MALAY NOISY TEXT NORMALIZATION: PERFORMANCE AND THRESHOLD ANALYSIS

Azilawati  Azizan; Nurkhairizan  Khairudin; Muhammad Fitri Shazwan  Fadzely; Nursyahidah  Alias; Norshuhani  Zamin; Norlina Mohd  Sabri

Published: Aug 1, 2025

Keywords:

Noisy Text; Text Normalization; Malay Noisy Text; Similarity Measure; Threshold.

Azilawati Azizan

College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia

Nurkhairizan Khairudin

College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia

Muhammad Fitri Shazwan Fadzely

Pembangunan Sumber Manusia Berhad, Malaysia

Nursyahidah Alias

College of Computing, Informatics and Mathematics, Universiti Teknologi MARA, Perak Branch, Malaysia

Norshuhani Zamin

College of Computer Studies, De La Salle University, Philippines

Norlina Mohd Sabri

College of Computing, Informatics and Mathematics, Universiti Teknologi MARA Terengganu, Malaysia

Abstract

Noisy text normalization is a critical preprocessing step in natural language processing (NLP), particularly for user-generated content (UGC) that contains a lot of slang, abbreviations, and typographical errors. This extended study investigates the performance of multiple similarity measures in normalizing Malay noisy text, addressing gaps in prior study that predominantly relied on rule-based approaches and single similarity measures. By systematically evaluating token-based, edit-based, and sequence-based similarity measures across various thresholds, this study provides a comprehensive analysis of their effectiveness and computational efficiency. The methodology comprises a two-phase experiment: an initial phase to identify optimal thresholds using a small dataset and a second phase that generalizes findings on a larger dataset. Key findings reveal that edit-based measures, such as Levenshtein Distance and Damerau-Levenshtein, consistently outperform other measures at lower thresholds, achieving normalization success rates exceeding 83%. Ratcliff/Obershelp emerged as the most effective sequence-based measure, while token-based measures like Jaccard and Cosine demonstrated limited performance. The study also highlights the critical role of threshold in balancing normalization accuracy and flexibility. Additionally, computational time analysis underscores the trade-offs between accuracy and efficiency across similarity categories. These findings pave the way for more robust and adaptable text normalization strategies, particularly for Malay language studies.

Downloads

Download data is not yet available.

How to Cite

Azizan, A. ., Khairudin, N. ., Fadzely, M. F. S. ., Alias, N. ., Zamin, N. ., & Sabri, N. M. . (2025). EVALUATING SIMILARITY MEASURES FOR MALAY NOISY TEXT NORMALIZATION: PERFORMANCE AND THRESHOLD ANALYSIS. Malaysian Journal of Computer Science, 38. Retrieved from https://mjcs.um.edu.my/index.php/MJCS/article/view/63760

Issue

Vol. 38 (2025): Special Issue on Information Retrieval and Knowledge Management (CAMP 2024)

Section

Articles

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Most read articles by the same author(s)