NLP - Jacopo Farina's blog

this is _default loayout

Recent Post

Analyzing the Tatoeba dataset

2025.02.07

Tatoeba is a website to crowdsource sentences translated in several languages, a resource that is very useful to language learners or people interested in NLP.

I am a contributor and an user of Tatoeba, where I mostly translate sentences to Italian. In 2020 Tatoeba organized an event called Kodoeba to which I participated with an automated cloze deletion tool.

In this article I’m going to analyze the Tatoeba dataset and build some charts from it.

...

Add punctuation to text using Skorch

2019.12.04

Note: The whole code for this article is freely available on GitHub

A few weeks ago I saw a talk about Skorch, a library that wraps a PyTorch neural network to use it as a Scikit-learn model.

That is amazing: I can take an existing product based on, say, a random forest, and replace only the model without refactoring anything else: the fit and predict functions have the usual interface. On the other hand, I can use the powerful tools offered by Scikit-learn, like the grid search for hyperparameters and make_pipeline to apply encoders.

...

Fleximatcher, a library to help parse natural language

2017.11.30

Note: this is an old article and while the software is still available and I think the idea is pretty you probably want to give a look to this. The software here described is probably more flexible, but harder to use.

Some months ago, I stumbled across this amazing article about transforming an arbitrary English text in a patent application. The underlying pattern library allows, among other nice things, to find patterns like “The [an adjective] [a noun] and the [a noun]” easily, look for hypernyms (“is a” relationships between expressed concepts, for example “animal” is an hypernym of “cat”) in WordNet, and conjugate verbs in various languages comprehending Italian.

...