"Language Discrimination" in the AI syllable: Ask Claude, token consumes more than three times more than English

On April 30th, yesterday, AI researcher Aran Komatsuzaki published a large-scale mainstream symmetry tooltokenthe results of a cross-examination show that Tokenizer has "language discrimination":

"Language Discrimination" in the AI syllable: Ask Claude, token consumes more than three times more than English

when using the same model, non-english users actually consume far more tokens than english users, amounting to a quiet “non-english tax”。

He translated the famous paper by Rich Sutton, The Bitter Lesson, into nine languages and fed tokenizers of six models, using a benchmark of 1 times the number of tokens on the OpenAI semiword tool in the original English language to measure the consumption of languages on different models。

The results show that the same content is being asked in ChineseClaude token consumes 1.71 times the baseline, while OpenAI only 1.15 times. The situation in Hindi is more pronounced in Claude, where token consumes 3.24 times more than the benchmark and the Arabic language is 2.86。

6 Among the models cross-referenced, Anthropic has the highest “non-English tax”, followed by Kimi; Gemini and Qwen have the lowest non-English tax. Komatsuzaki put it bluntly: "I honestly didn't think Claude would be this close, and the gap was so wide. I'm sure corporate clients are very concerned about these issues. I'm sorry

Komatsuzaki noted that the efficiency of the syllables depends on the proportion of languages in model training data: English data is large and English terminology is efficiently compressed; non-English data are fewer and can only be cut more。

For users, the increase in token consumption means that API call costs rise directly, that the waiting time before the model responds is longer and the context window will run out faster. He came to the conclusion that who's big and who's token is much more economical。

statement:The content of the source of public various media platforms, if the inclusion of the content violates your rights and interests, please contact the mailbox, this site will be the first time to deal with.
Information

DeepSeek internal speculation mode, new multimodular model or will be released

2026-4-30 11:39:18

HeadlinesInformation

The country's first undergraduate "Commercial Artificial Intelligence" has been approved

2026-4-30 11:41:43

Search