Experimental Preview: This leaderboard is currently in preview mode and the results may change as we refine our evaluation methodology.

Literary Translation Evaluation and Rating Ensemble

LiTERatE is a benchmark specifically designed for evaluating machine translation systems on literary text from Chinese, Japanese, and Korean languages. Unlike traditional MT benchmarks, LiTERatE focuses on the unique challenges of literary translation with its creative and nuanced nature.

Our evaluation uses chunks of 200-500 CJK characters as the basic unit, providing terminology glossaries and contextual information to all systems. An ensemble of LLMs judges translations through head-to-head comparisons with human translations, achieving 82% accuracy compared to decisive human judgments.

The scores below represent each system's win rate against human translators (0-100). A score of 50 indicates parity with human translation quality, while higher scores suggest superior performance.

Learn more about our evaluation methodology.

Rank
Model
Win Rate
1Deepseek R187.3%
2Omni Qi67.6%
3o3-mini62.7%
4GPT-4o61.0%
5Deepseek V357.7%
6Claude 3.7 Sonnet54.0%
7Claude 3.5 Sonnet52.0%
8Gemini 1.5 Pro50.7%
9Qwen Max49.3%
10Qwen Plus49.3%
11Gemini 2.0 Flash43.0%
12Mistral Large40.0%
13Gemini Flash 1.5 8B38.3%
14GPT-4o-mini35.3%
15Phi-433.0%
16Llama 3.3 70B32.7%
17Gemini 2.0 Flash Lite31.3%
18Claude 3.5 Haiku30.7%
19Mistral Small 327.7%
20Qwen Turbo27.3%
21Google Translate (NMT)6.7%