Analyzing the Tatoeba dataset

Tatoeba is a website to crowdsource sentences translated in several languages, a resource that is very useful to language learners or people interested in NLP.

I am a contributor and an user of Tatoeba, where I mostly translate sentences to Italian. In 2020 Tatoeba organized an event called Kodoeba to which I participated with an automated cloze deletion tool.

In this article I’m going to analyze the Tatoeba dataset and build some charts from it.

Method

This analysis uses DuckDB as an engine to transform this dataset into Parquet files and query them. A Python script can download the fresh datasets if needed, generate the parquet and then convert a Jinja template to generate the whole article as a markdown file.

For charts, the script generates a JSON file and stores it separately. In the article there’s a placeholder referencing this JSON file. A small Javascript code finds the elements with this placeholder and loads the corresponding charts, allowing a mixture of markdown and fancy charts.

This article is the result of this tool, and the code is here.

I also used NiceGUI as a way to speed up the process of iterating over options and see them quickly.

What is inside Tatoeba

Tatoeba has, at the moment of writing (February 2025), 12556682 sentences in 421 languages, either added as original sentences or as translations of existing ones.

Growth of the data over time

How was this amount reached? Let’s see the growth over time:

the dataset keeps growing at a speed of about 1 million sentences per year (or almost two sentences per minute).

Looking at the new sentences added over time we can notice a few peaks in November 2018 and November 2020. The latter was in fact the month with most sentences added in the life of the project, with 193K new sentences in a single month. This seems to be due to a large amount of sentences in Kabyle added in this month. I could not find posts or articles explaining what happened, but since they seem to come from multiple users (45 different users contributed sentences in Kabyle in that month) it is probably due to some initiative started in a group of Kabyle speakers.

Original sentences vs translations

A sentence can be added to Tatoeba either as an original sentence in a given language, or as a translation of an existing sentence in another language.

Currently, of the 12556682 sentences, 10728611 (the 85%) are translations.

As a contributor I had a feeling of this, but now we can see it in numbers: the majority of the contributions are translations of existing sentences rather than brand new ones. Given the amount of languages Tatoeba covers this is not too surprising, but it would be possible that there are large amounts of “hidden” untranslated entries, particularly when they are in a language I would not see in my usual English -> Italian. This is not the case.

As a contributor I often find a similarity between sentences in English (often mentioning “Mary” and “Tom”) so I assumed that many original sentences were added all together at a point in the past and are being gradually translated. If that was the case, then, the amount of translations added per month should follow an exponential curve, as the most common languages get “saturated” and translations slow down. We already saw this is not what’s happening, and we can better understand the relationship between original sentences and translation by observing the same value over time:

we can see now that there was no “bulk insertions” of sentences except the big Kabyle language one in 2020, instead sentences are added continuously and “percolate” into translations, month after month, keeping the ratio between the two constant around 85%.

Since both the process of adding and translating sentences operate at the same time, one may wonder how long a new sentence usually goes before receiving at least one translation.

Retrieving this is quite tricky: Tatoeba not only has original sentences and translated ones, but also “linked” sentences. In essence, a link is created between two sentences A and B in a few cases:

an user explicitly added B as a translation of A
a sentence A is a translation of B and a sentence C is identical (same text) to A. In this case C is linked to B
an “advanced” contributor links A to B, telling the system that the two have the same meaning

the meaning of linking is a bit confusing, and for many sentences the website itself reports that:

We cannot determine yet whether this sentence was initially derived from translation or not.

This is what I reconstructed empirically from the links table:

a link is bidirectional, even though the columns are named from and to
in general for every pair from-to there are two entries to represent the link in both directions
since a link can manually be created after the sentences, the order in which sentences are linked does not reflect the order in which thy are created.
It’s possible to have a sentence added without the user even knowing it’s a translation of an existing one, and have it linked after some time
Since it’s possible to have indirect translations and also translate the same sentence multiple times to the same language, a cluster of links can contain the same sentence (or better sentence meaning) multiple time for the same language, with different wordings

It may fel strange to have a language present multiple times in a group of linked sentences, but it makes sense even when the translations are added directly: I for example sometimes give multiple translations of the same English sentence to Italian to cover the different genders or politeness levels, a grammar feature that is not present in English.

From these facts it’s clear that links form “clusters” of sentences connected to each other. A link is symmetrical, reflexive and transitive, so this is an equivalence and indeed each cluster represents a “meaning” across different languages and different wordings on the same language.

There are 1631744 of those clusters, with an average of 7.7 sentences per each one.

The larger clusters currently are the ones with 59, 1157 and 9336 as the lowest sentence id. The first two mean “this is not important” and the third is “everything is OK”. They contain a whopping 10K sentences each, and the first two are indeed the same cluster but a link from 59 to 1157 is not present so my script does not detect it.

Overall A sentence get a first link on average 275.27 days after its creation. The median time is 12.01 days..

Contributors

Contributions are made by users, and a good indicator of he project health is the amount of active users, that is, users who did add a sentence over a period of time.

We can see how many users did add at least one sentence for each year:

it looks like the amount of active users did peak in 2010-2016, and is now going down although not dramatically. The amount of contributions is, as we have seen, quite constant, suggesting that the top contributors remain active and represent the majority of the volume.

I do not know why the number of active contributors is going down. I can only make three guesses:

The project is not a novelty anymore (it’s almost 20 years old), and the language enthusiast are usually already aware of its existence.
The presence of large amounts of sentences for the top languages may give the impression that the value of adding new sentences is low (diminished returns). For a language like English one can assume that every word and grammar concept have been covered and there’s little to add.
The advent of AI products and the spammy/unethical behavior of many companies behind them got a considerable backlash in the “open source” space, so people may feel less inclined to contribute to open datasets that can be used as training material. It has to be noted however that the decline in contributors started before that and projects like Wikipedia do not show such a pattern

When users contribute to a project like this contributions are generally uneven. A few contributors do write a large amount of sentences over time while a large amount of contributors add very few. This is quite normal, and it happens in projects like Wikipedia or OpenStreetMap as well.

Equality of contributions per users: Gini coefficient

How uneven are contributions across users? We can apply a common instrument called Gini Coefficient. This is a coefficient used to measure inequality and commonly applied to economics. A coefficient of 1 indicates maximum inequality (in this case, a single users did all the contributions), while 0 is for perfect equality (all users contributed the same).

To give a reference, Wikipedia has a Gini coefficient (measured over articles written per user) between 0.92 and 0.96 over th top languages.

The same value for Tatoeba is 0.972. Higher, but not dramatically so.

Sentences length

How long is a sentence, usually? Considering only the languages with more than 10K sentences, this is the distribution:

unsurprisingly languages using syllables (hangul and hiragana/katakana scripts for Korean and Japanese respectively) and ideograms (Chinese variations and Japanese) tend to use a lot less characters per sentence.

Split across languages

Lastly, this is the amount of sentences per language over time: