Experimental Preview: This leaderboard is currently in preview mode and the results may change as we refine our evaluation methodology.
Literary Translation Evaluation and Rating Ensemble
LiTERatE is a benchmark specifically designed for evaluating machine translation systems on literary text from Chinese, Japanese, and Korean languages. Unlike traditional MT benchmarks, LiTERatE focuses on the unique challenges of literary translation with its creative and nuanced nature.
Our evaluation uses chunks of 200-500 CJK characters as the basic unit, providing terminology glossaries and contextual information to all systems. An ensemble of LLMs judges translations through head-to-head comparisons with human translations, achieving 82% accuracy compared to decisive human judgments.
The scores below represent each system's win rate against human translators (0-100). A score of 50 indicates parity with human translation quality, while higher scores suggest superior performance.
Learn more about our evaluation methodology.
Rank | Model | Win Rate |
---|---|---|
1 | Deepseek R1 | 87.3% |
2 | Omni Qi | 67.6% |
3 | o3-mini | 62.7% |
4 | GPT-4o | 61.0% |
5 | Deepseek V3 | 57.7% |
6 | Claude 3.7 Sonnet | 54.0% |
7 | Claude 3.5 Sonnet | 52.0% |
8 | Gemini 1.5 Pro | 50.7% |
9 | Qwen Max | 49.3% |
10 | Qwen Plus | 49.3% |
11 | Gemini 2.0 Flash | 43.0% |
12 | Mistral Large | 40.0% |
13 | Gemini Flash 1.5 8B | 38.3% |
14 | GPT-4o-mini | 35.3% |
15 | Phi-4 | 33.0% |
16 | Llama 3.3 70B | 32.7% |
17 | Gemini 2.0 Flash Lite | 31.3% |
18 | Claude 3.5 Haiku | 30.7% |
19 | Mistral Small 3 | 27.7% |
20 | Qwen Turbo | 27.3% |
21 | Google Translate (NMT) | 6.7% |