In this article I’ll explain how I populated the database that powers grammarquiz, a grammar quiz app that I created for the Kotoeba initiative. The code of the application is freely available.

The backstory

As you may already know, Tatoeba is a database of sentences translated in different languages. The database is at this time (early 2021) almost 10 million sentences strong and keeps growing. The dataset can be downloaded and used with an open license, similar to Wikipedia or Openstreetmap, which makes it very interesting for users who, like me, have interest in NLP and languages.

Tatoeba started in 2020 an initiative called Kodoeba to promote the development of the site itself and the tooling surrounding it; I took part with the mentioned grammar-quiz app.

The challenge

I use flashcards every day to expand and consolidate my vocabulary, but when it comes to grammar it gets convoluted. I can put the grammar rules into flashcards, and I do, but when talking or writing it’s not feasible to recall the appropriate rule and apply it like a software would, especially when operating German and its many many rules regarding cases and declensions. Our fleshy brains need context, and we need to train and get used to apply a grammar rule inside sentences.

To solve that problem, I decided to exploit the Tatoeba dataset to provide cloze deletion cards. In short, these are full sentences with one or more gaps to fill. In the case of grammar, the gap is either a word with no specific sense a syntactical role (like an article or a conjunction, or also some common adverbs) or a word that needs to be conjugated, for example in German the ending of adjectives is affected by the gender, case and type of article of the noun they refer to.

The Tatoeba dataset offers sentences in more than 400 languages, and I decided to try and generate these cards automatically, using NLP and simple statistical tools to produce the clozes.

The input dataset

The dataset can be downloaded as two TSV files, one enumerating the sentences and their languages and the other listing the links, that is, which sentence is a translation of which other. Additional metadata like the author and the date are provided but not used here.

Given the wide scope of the project, I was pleasantly surprised to see that languages like Sicilian, Klingon, Ido, Latin, Toki Pona, Ancient Greek and multiple dialects of Chinese are supported, just to mention a few. However, given the statistical nature of the approach I decided to cut the list to languages with a sensible amount of sentences in the database, and keep about 300 languages of the more than 400 available.

The size of the dataset is such that it can be completely loaded into memory and handles using Python dictionaries and sets without any special technique.

The basic idea

For every language, given a wide corpus like the one provided by Tatoeba, the most common words are the ones with a grammar/syntactical role more than a meaning by themselves, words like the, not, was or is. These are called stop words in the jargon of computational linguistics and information retrieval, and are usually filtered out in order to index documents or model a topic avoiding wasting resources and adding noise. What we want to do here is exactly the opposite, the user has to be asked about these words and their placement on a real sentence, not the translation of arbitrary terms which can be learned with traditional flashcards.

A simple way to build this filter is to process the corpus and retrieve the most frequent tokens for each given language, since they are almost always the stop words we are looking for.

Then, to generate the declension cards (that is, cards for which users get a base form and has to conjugate it), a multilingual lexical database can be used.

Tokenizing

The task of splitting a test into tokens and possibly normalizing them, which is a necessity step to build the frequency table, is far from trivial when one considers that languages like Chinese or Japanese make no use of spaces, and cases are missing or have different usages in different scripts.

Initially I tried to use the language metadata from Tatoeba to dispatch each sentence to the proper tokenizer, but there are quite many and often with a convoluted setup, so I ended up using the ICU library. This library covers a plethora of languages with sensible fallbacks, and has the advantage of generating tokens in a non-destructive manner: concatenating the tokens in order we can retrieve the original text, punctuation included.

As you can guess from the website user-friendliness is not the main goal of the project, but using a dockerfile eventually got a reproducible environment to run this script. The python binding is straightforward to use.

Base forms from en.wiktionary

The stopwords alone can generate a decent quiz, but it can be improved by integrating a lexical database and giving the user the base form of verbs and whatever part of the speech the language allows to inflect.

To this end, en.wiktionary offers an extensive dataset that can be freely downloaded and used. The problem with it is that it is not machine readable per se, and even when taking advantage of Mediawiki templates there is a whole zoo of patterns to handle.

Luckily, somebody already made the effort of tackling this problem and the Wiktextract tool can crawl a Wiktionary dump, run the Lua scripts that render much of the content and generate JSON files representing the content in a machine-readable form. The process can take days of computation, but the author of the script is also kind enough to provide the dictionary JSON files split by language.

Using this data I can match the tokens to their dictionary form, that must be unambiguous, that is:

  • no more than one root form exists for a token
  • there are no multiple interpretation for a token (e.g. the word “rule” can be a noun or the verb “to rule”)
  • a token can be the base form of itself (e.g. a verb is already in the infinitive form)

filtering only the unambiguous tokens, the clozes can be expanded and become more interesting. For languages missing Wiktionary entries, the basic cloze deletion is still available.

Choosing a candidate for the cloze

Now that every sentence is split into tokens, labeled as stopwords and with their corresponding base forms, the tool has to generate the quiz questions by picking tokens to replace with cloze. We still have to decide:

  • the distribution of the number of the clozes
  • the maximum amount of clozes per sentence
  • the frequency threshold to consider a token a stopword
  • a list of tokens not to be replaced

these parameters had to be found by trial and error by running the script and judging the “feeling” of the cards, that need to be feasible but not trivial.

The process is made deterministic for the single sentence by using the sentence ID as a seed. This is necessary because a new sentence dataset is published every week and an update should not alter existing cards, or it would invalidate the spaced repetition approach.

Sometimes, an empty cloze is generated; this is necessary to not let the user “overfit” on the fact that there is indeed a token, which is useful for consecutive tokens like the als ob in German that would become trivial otherwise.