Paper | Dataset Download | Model Download |
An English–Khmer transliteration system built on an Attention-Based Bidirectional GRU architecture. The model automatically converts romanized Khmer text written in the Latin alphabet into its corresponding Khmer script form. To enhance accuracy and ensure linguistic validity, the system incorporates a Khmer dictionary-based post-processing step for proof checking and correction.
The dataset used in this project was sourced from the Khmer Text Transliteration Dataset by Chhunneng (2023), which provides parallel pairs of English–Khmer transliterations for machine learning research.
There are 77 unique Khmer characters and 26 unique English characters in the dataset. The maximum sequence length for English (romanized Khmer) inputs is 25 characters, while the maximum Khmer output length is 24 characters.
Figure 1 | Word Length Distribution of Romanized Khmer, and Khmer Script- Total Samples: 28,569
- Train Set: 22,855 (80%)
- Validation Set: 5,714 (20%)
- Format: Parallel text pairs (
brodae: ប្រដែ)
The model is based on an Attention-Based Bidirectional GRU architecture designed for sequence-to-sequence transliteration. It follows an encoder–decoder structure, where the encoder processes the input Latin Script (English) sequence, and the decoder generates the corresponding Khmer script sequence character by character.
Figure 2 | The proposed Attention-based Bidirectional Gated Recurrent Unit architecture.The encoder uses a Bidirectional GRU layer to process text from both start and end directions within the input sequence. This allows the model to better understand dependencies across the entire input text, which is particularly useful for transliteration tasks where phonetic relationships depend on both preceding and succeeding characters.
Each input token is first mapped into a continuous vector space through an embedding layer, which converts discrete character indices into dense embeddings of dimension 32. The bidirectional GRU then encodes these embeddings into a hidden state representation that encapsulates forward and backward context information.
The decoder consists of a GRU layer that processes the output sequence one Khmer character at a time. At each decoding step, it receives the previously predicted token and the projected hidden state from the encoder. The attention mechanism then combines the encoder’s output representations with the decoder’s current hidden state to generate a context vector, which helps the model focus on the most relevant parts of the input sequence. This context vector is concatenated with the decoder’s GRU output and passed through a dense softmax layer to produce the final character prediction.
- Embedding Dimension: 32
- GRU Units: 64
- Attention Mechanism: Additive Attention
- Batch Size: 64
- Epochs: 50
- Validation Split: 20%
- Optimizer: Adam
- Loss Function: Sparse Categorical Crossentropy
- Learning Rate Scheduler: ReduceLROnPlateau (factor=0.5, patience=3)
To maximize accuracy without adding model bloat, our post-processing pipeline ensures all outputs are valid Khmer words. The model first generates three candidate transliterations using Beam Search (k=3). These are cross-referenced with a standard dictionary; if no exact match exists, a Levenshtein distance fallback corrects the prediction to the closest valid word (up to a 2-character edit limit).
Figure 3 | End-to-End inference pipeline with post-processing technique. The input flow through the encoder-decoder model to generate 3 prediction using beam search (k=3). Following the beam search output, a lexical validation layer cross-referencing the outputs with Khmer Dictionary. The pipeline uses Levenshtein distance to recover the closest orthographically valid entries, resulting in the finals words being all valid.To evaluate the performance of the model, Character Error Rate (CER) was used. CER quantifies the number of errors at the character level, providing a direct measure of the model's accuracy in converting individual Romanized graphemes to their corresponding Khmer script. It is calculated as:
where
Additionally, to evaluate the post-processing technique, we also assess word-level accuracy by measuring the Top-1 and Top-k hit accuracy. We define Top-1 accuracy as the percentages of test samples where the highest-ranked candidates produced by the system exactly matches the ground truth:
where
Top-k Hit accuracy measures the frequency with which the correct Khmer word appears within the list of
Table 1 | Model comparison on the validation set using Character Error Rates (CER %). Lower CER indicates better performance. The Lowest CER is bold and the second lowest is italicize.
| Model | CER (%) | Parameters |
|---|---|---|
| RNN | 102.41 | 42,609 |
| LSTM | 31.78 | 58,417 |
| GRU | 51.64 | 46,385 |
| Transformer | 17.78 | 248,337 |
| Attention BiLSTM | 18.26 | 96,753 |
| AkaraAlpha | 15.07 | 78,705 |
Table 2 | Performance comparison of system configuration. A comparative analysis of Top-1 accuracy, Top-K hit rate, and average inference latency across greedy decoding and varying beam search widths (k)
| Configuration | Top-1 | Top-K Hit | Latency (ms/word) |
|---|---|---|---|
| Greedy Decoding | 44.44% | 44.44% | 21.92 |
| K=3 | 70.06% | 78.72% | 66.03 |
| K=5 | 73.60% | 83.41% | 94.15 |
| K=7 | 74.87% | 85.60% | 122.10 |
| K=10 | 75.85% | 87.37% | 167.21 |
Table 3 | Qualitative comparison across all model showing model performance across various instances of Romanized Khmer Script ranging from short to long words.

git clone https://github.com/NDarayut/english-khmer-transliteration.git
cd english-khmer-transliteration
pip install -r requirements.txtfrom inference import transliterate_text
print(transliterate_text(eng_input="brodae", beam_width=3, max_length=32))
# Expected Result: 'ប្រដែ'from inference import transliterate_top_n
print(transliterate_top_n("brodae", beam_width=5, max_length=32, n=3))
# Expected Result: ['ប្រដែ', 'បរដែ', 'ប្រតែ']from inference import transliterate_with_dict
print(translitertransliterate_with_dictate("brodae", beam_width=5, max_length=32, n=3, max_distance=2))
# Top Candidates from model: ['ប្រដែ', 'បរដែ', 'ប្រតែ']
# Valid Candidates after filtering: ['ប្រដែ', 'រដែ', 'ប្រែ']To demonstrate the functionality of this transliteration system, a simple Flask web application has been created.
To run the application:
python app.pyOpen your web browser and navigate to: http://127.0.0.1:5000/
Chhunneng. (2023). Khmer Text Transliteration Dataset. GitHub repository.
Available at: https://github.com/Chhunneng/khmer-text-transliteration






