GitHub - netra-ai-lab/Romanized-Khmer-to-Khmer-Transliteration: An English–Khmer transliteration system built on an Attention-Based Bidirectional GRU architecture.

Paper | Dataset Download | Model Download |

AkaraAlpha: An Efficient Romanized Khmer-Khmer Script Transliteration Model

Performance benchmark and system optimization. (a) Comparison of standalone model architectures based on Character Error Rate (CER). (b) Trade-off analysis between Top-1 accuracy and inference latency across varying beam widths (k). The red dotted line denotes the 100ms real-time latency threshold, identifying k=5 as the optimal configuration for practical deployment.

Overview

An English–Khmer transliteration system built on an Attention-Based Bidirectional GRU architecture. The model automatically converts romanized Khmer text written in the Latin alphabet into its corresponding Khmer script form. To enhance accuracy and ensure linguistic validity, the system incorporates a Khmer dictionary-based post-processing step for proof checking and correction.

Dataset

The dataset used in this project was sourced from the Khmer Text Transliteration Dataset by Chhunneng (2023), which provides parallel pairs of English–Khmer transliterations for machine learning research.

There are 77 unique Khmer characters and 26 unique English characters in the dataset. The maximum sequence length for English (romanized Khmer) inputs is 25 characters, while the maximum Khmer output length is 24 characters.

Figure 1 | Word Length Distribution of Romanized Khmer, and Khmer Script

Total Samples: 28,569
Train Set: 22,855 (80%)
Validation Set: 5,714 (20%)
Format: Parallel text pairs (brodae: ប្រដែ)

Model Architecture

The model is based on an Attention-Based Bidirectional GRU architecture designed for sequence-to-sequence transliteration. It follows an encoder–decoder structure, where the encoder processes the input Latin Script (English) sequence, and the decoder generates the corresponding Khmer script sequence character by character.

Figure 2 | The proposed Attention-based Bidirectional Gated Recurrent Unit architecture.

Encoder

The encoder uses a Bidirectional GRU layer to process text from both start and end directions within the input sequence. This allows the model to better understand dependencies across the entire input text, which is particularly useful for transliteration tasks where phonetic relationships depend on both preceding and succeeding characters.

Each input token is first mapped into a continuous vector space through an embedding layer, which converts discrete character indices into dense embeddings of dimension 32. The bidirectional GRU then encodes these embeddings into a hidden state representation that encapsulates forward and backward context information.

Decoder

The decoder consists of a GRU layer that processes the output sequence one Khmer character at a time. At each decoding step, it receives the previously predicted token and the projected hidden state from the encoder. The attention mechanism then combines the encoder’s output representations with the decoder’s current hidden state to generate a context vector, which helps the model focus on the most relevant parts of the input sequence. This context vector is concatenated with the decoder’s GRU output and passed through a dense softmax layer to produce the final character prediction.

Embedding Dimension: 32
GRU Units: 64
Attention Mechanism: Additive Attention

Training Configuration

Batch Size: 64
Epochs: 50
Validation Split: 20%
Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Learning Rate Scheduler: ReduceLROnPlateau (factor=0.5, patience=3)

Post-Processing Technique

To maximize accuracy without adding model bloat, our post-processing pipeline ensures all outputs are valid Khmer words. The model first generates three candidate transliterations using Beam Search (k=3). These are cross-referenced with a standard dictionary; if no exact match exists, a Levenshtein distance fallback corrects the prediction to the closest valid word (up to a 2-character edit limit).

Figure 3 | End-to-End inference pipeline with post-processing technique. The input flow through the encoder-decoder model to generate 3 prediction using beam search (k=3). Following the beam search output, a lexical validation layer cross-referencing the outputs with Khmer Dictionary. The pipeline uses Levenshtein distance to recover the closest orthographically valid entries, resulting in the finals words being all valid.

Results and Analysis

1. Evaluation Metrics - CER

To evaluate the performance of the model, Character Error Rate (CER) was used. CER quantifies the number of errors at the character level, providing a direct measure of the model's accuracy in converting individual Romanized graphemes to their corresponding Khmer script. It is calculated as:

$$CER = \frac{S + D + I}{N}$$

where $S$ represents the number of substitutions, $D$ is the number of deletions, $I$ is the number of insertions, and $N$ is the total number of characters in the ground truth sequence. A lower CER indicates higher accuracy in the transliteration output.

2. Evaluation Metrics – Top-1 and Top-k Hit Accuracy

Additionally, to evaluate the post-processing technique, we also assess word-level accuracy by measuring the Top-1 and Top-k hit accuracy. We define Top-1 accuracy as the percentages of test samples where the highest-ranked candidates produced by the system exactly matches the ground truth:

$$Top - 1 = \frac{1}{M} \sum_{i=1}^{M} (\hat{y}_{i,1} = y_i)$$

where $M$ is the total number of test samples, $y_i$ is the ground truth, and $\hat{y}_{i,1}$ is the number 1 ranked candidate produced by the system.

Top-k Hit accuracy measures the frequency with which the correct Khmer word appears within the list of $k$ candidates generated by the system:

$$Top - k = \frac{1}{M} \sum_{i=1}^{M} (y_i \in {\hat{y}_{i,1}, \hat{y}_{i,2}, \hat{y}_{i,3}, \dots, \hat{y}_{i,k}})$$

3. Quantitative Analysis

Figure 4 | Analytical trade-off between transliteration accuracy and computational overhead. (a) Search space benefit analysis illustrating the gap between primary prediction (Top-1) and system recall (Top-K). The shaded region represents the additive benefit of beam search; at the recommended k=5 setting, the system provides a 9.81 percentage-point (pp) recall gain over the top-1 prediction. (b) The computational efficiency frontier, highlighting the relationship between accuracy gains and inference latency. The gray bars represent the absolute magnitude of computational overhead, while the red dotted line tracks the linear latency trajectory across different beam widths. The optimal operating point is k=5, which achieves a 29.16% absolute accuracy increase over greedy decoding while remaining under the 100ms real-time latency threshold; beyond this point, the system exhibits diminishing returns as latency exceeds the limits of user interaction.

Table 1 | Model comparison on the validation set using Character Error Rates (CER %). Lower CER indicates better performance. The Lowest CER is bold and the second lowest is italicize.

Model	CER (%)	Parameters
RNN	102.41	42,609
LSTM	31.78	58,417
GRU	51.64	46,385
Transformer	17.78	248,337
Attention BiLSTM	18.26	96,753
AkaraAlpha	15.07	78,705

Table 2 | Performance comparison of system configuration. A comparative analysis of Top-1 accuracy, Top-K hit rate, and average inference latency across greedy decoding and varying beam search widths (k)

Configuration	Top-1	Top-K Hit	Latency (ms/word)
Greedy Decoding	44.44%	44.44%	21.92
K=3	70.06%	78.72%	66.03
K=5	73.60%	83.41%	94.15
K=7	74.87%	85.60%	122.10
K=10	75.85%	87.37%	167.21

4. Qualitative Analysis

Table 3 | Qualitative comparison across all model showing model performance across various instances of Romanized Khmer Script ranging from short to long words.

Installation

git clone https://github.com/NDarayut/english-khmer-transliteration.git
cd english-khmer-transliteration
pip install -r requirements.txt

Usage

1. Generate single transliteration

from inference import transliterate_text

print(transliterate_text(eng_input="brodae", beam_width=3, max_length=32))
# Expected Result: 'ប្រដែ'

2. Generate multiple transliteration

from inference import transliterate_top_n

print(transliterate_top_n("brodae", beam_width=5, max_length=32, n=3))
# Expected Result: ['ប្រដែ', 'បរដែ', 'ប្រតែ']

3. Generate multiple transliteration with correction

from inference import transliterate_with_dict

print(translitertransliterate_with_dictate("brodae", beam_width=5, max_length=32, n=3, max_distance=2))
# Top Candidates from model: ['ប្រដែ', 'បរដែ', 'ប្រតែ']
# Valid Candidates after filtering: ['ប្រដែ', 'រដែ', 'ប្រែ']

Web Application

To demonstrate the functionality of this transliteration system, a simple Flask web application has been created.

To run the application:

python app.py

Open your web browser and navigate to: http://127.0.0.1:5000/

Demo

Citation

Chhunneng. (2023). Khmer Text Transliteration Dataset. GitHub repository.
Available at: https://github.com/Chhunneng/khmer-text-transliteration

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Attention-BiLSTM		Attention-BiLSTM
BiGRU-Attention		BiGRU-Attention
BiGRU		BiGRU
GRU		GRU
LSTM		LSTM
RNN		RNN
Transformer		Transformer
assets		assets
paper		paper
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
data_preprocessing.py		data_preprocessing.py
eng_khm_data.txt		eng_khm_data.txt
inference.py		inference.py
khmer_dictionary.txt		khmer_dictionary.txt
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
test.py		test.py
transliteration_model.ipynb		transliteration_model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AkaraAlpha: An Efficient Romanized Khmer-Khmer Script Transliteration Model

Overview

Dataset

Model Architecture

Encoder

Decoder

Training Configuration

Post-Processing Technique

Results and Analysis

1. Evaluation Metrics - CER

2. Evaluation Metrics – Top-1 and Top-k Hit Accuracy

3. Quantitative Analysis

4. Qualitative Analysis

Installation

Usage

1. Generate single transliteration

2. Generate multiple transliteration

3. Generate multiple transliteration with correction

Web Application

Demo

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AkaraAlpha: An Efficient Romanized Khmer-Khmer Script Transliteration Model

Overview

Dataset

Model Architecture

Encoder

Decoder

Training Configuration

Post-Processing Technique

Results and Analysis

1. Evaluation Metrics - CER

2. Evaluation Metrics – Top-1 and Top-k Hit Accuracy

3. Quantitative Analysis

4. Qualitative Analysis

Installation

Usage

1. Generate single transliteration

2. Generate multiple transliteration

3. Generate multiple transliteration with correction

Web Application

Demo

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages