Jacopo Farina's blog

Calculating the reachability of Metro stations in Milan

2025.02.18

Some time ago I saw a reddit post on r/milano presenting a visualization of the nearest Metro or train station in the city of Milan. In short, it indicates for each station which area is “covered” by it, having it as the closest station. That visualization uses Voronoi cells built with the Metro stations as the centers, and the distance metric is then the geodesic distance (“as the crow flies”), but if a station is easy to reach by tram or, on the other side, is surrounded by railways or highways making crossing difficult this distance will not represent well how “reachable” the station is by someone walking or using the tram....

Analyzing the Tatoeba dataset

2025.02.07

Tatoeba is a website to crowdsource sentences translated in several languages, a resource that is very useful to language learners or people interested in NLP. I am a contributor and an user of Tatoeba, where I mostly translate sentences to Italian. In 2020 Tatoeba organized an event called Kodoeba to which I participated with an automated cloze deletion tool. In this article I’m going to analyze the Tatoeba dataset and build some charts from it....

Writing a Tree-sitter grammar, I found the UX is great!

2024.12.23

Syntax highlighting with my Tree-sitter grammar In the last months I worked on a new project, a dashboard/blog/dataviz experiment focused on the city of Milan, and doing so started focusing on ways to automate and validate the integration of WebAssembly, DuckDB SQL, JS, Markdown and other things. A concept I come across again and again playing with this is parsing. I use tools to parse Markdown, transpile Typescript into something the browser understands, translate schemas from JSON to Parquet to Vega-lite charts and a lot of other transformations....

Lots of fun with Postgres and Python timezone shenanigans!

2023.08.07

There are few things developers love more than having to handle timezones. One of them is having to handle timezones in different environments! Lately I had to deal with some timezone operations across Python and Postgres and decided to document here the shenanigans and quirks of the two systems and how I try to avoid them. TIMESTAMP WITH TIME ZONE does NOT store a timezone This is something I knew already but it irks me every time I remember it exists....

Implement a CHIP-8 emulator in Python

2023.02.21

For quite some time I entertained the idea of implementing an emulator. My knowledge of low level programming is mostly teoretical and this would be a good chance to learn more, and also to experiment with optimizations I rarely encounter in my usual machine learning tasks (being based on libraries like Numpy and Scikit-learn which already take care of the heavier operations). The Game Boy is an obvious candidate, being it a console I had as a kid, well documented and for which there are many existing implementations including a Python one....

Making a fully static map, part 3: Text search

2022.11.21

NOTE: a complete interactive demo of the final result is here. In the previous post of this series, we saw how to generate vector tiles starting from an OpenStreetMap PBF extract using Tilemaker. After the article I refined the process and wrote a Python tool to automate it, adding the possibility to index named objects like streets and shops. Usually, such a search would be performed using a geocoding service that can handle the full text search with all the nuances like alternative spellings, typos and ambiguities....

Making a fully static map, part 2: Vector tiles

2022.06.01

UPDATE: I created a Python tool to automate this process, including a refined style and packaging. I suggest using it. In the previous post of this series we saw that an extract of the data from OpenStreetMap can be easily transformed into a set of raster tiles, essentially fragments of the map at different levels of zoom, arranged in a structure that enables a library like Leaflet.js to fetch them as needed when the user zooms and pans on the map....

Making a fully static map, part 1: Generate raster tiles from QGIS

2022.05.28

In this article we are going to implement an interactive map that can be included in a fully static website. By fully static I mean that the map does not rely on any external service nor a backend, it is just a bunch of files served directly by nginx (like this blog) or even a CDN. This approach is generally cheaper and simpler to operate, maintain and migrate, without depending on external services whose terms of use may change....

Lessons learned using Postgres in production

2022.04.11

On April, 12th 2022 I willgive a talk at PyCon Berlin about how we use Postgres in a data science project at Flixbus. These are the slides for this presentation, you can contact me on the conference Discord, Twitter, Github or in person at the venue. Download the presentation...

Render a building in 3D from OpenStreetMap data

2022.02.21

Since quite some time I have an interest in GIS and rendering, and after experimenting with the two separately I decided to finally try and render geographical data from OpenStreetMap in 3D, focusing on a small scale never bigger than a city. In this article I will go through the process of generating a triangle mesh from a building shape, rendering and exporting it in a format suitable for Blender or game engines like Godot....

Insert data into Postgres. Fast.

2021.04.25

The task of ingesting data into Postgres is a common one in my job as data engineer, and also in my side projects. As such, I learned a few tricks that here I’m going to discuss, in particular related to ingesting data from Python and merging it with existing rows. Before starting, I have to say the fastest way to insert data into a Postgres DB is the COPY command, which has a counterpart \copy on the psql CLI tool that is useful to invoke it remotely....

Generate a grammar quiz in 300+ languages using simple NLP

2021.04.24

In this article I’ll explain how I populated the database that powers grammarquiz, a grammar quiz app that I created for the Kotoeba initiative. The code of the application is freely available. The backstory As you may already know, Tatoeba is a database of sentences translated in different languages. The database is at this time (early 2021) almost 10 million sentences strong and keeps growing. The dataset can be downloaded and used with an open license, similar to Wikipedia or Openstreetmap, which makes it very interesting for users who, like me, have interest in NLP and languages....

Recent Posts