Writing a Tree-sitter grammar, I found the UX is great!
In the last months I worked on a new project, a dashboard/blog/dataviz experiment focused on the city of Milan, and doing so started focusing on ways to automate and validate the integration of WebAssembly, DuckDB SQL, JS, Markdown and other things.
A concept I come across again and again playing with this is parsing. I use tools to parse Markdown, transpile Typescript into something the browser understands, translate schemas from JSON to Parquet to Vega-lite charts and a lot of other transformations.
I also have the pet peeve of validating as much as possible as early as possible, finding out, for example, if I’m trying to display a value that does not exist in a chart generated from a Parquet file in turn produced by a DuckDB query. Or maybe automate some step, like detecting which values could be displayed and how. These kinds of automations are hard.
Anyway, doing so I often come to a point where I think it would be great to be able to parse and manipulate code programmatically. Finding out types and function, allowing the user to input configuration as scripts that can be executed, reporting on errors, adding macros to languages that don’t have them (yet)).
But above all, I think it would be fun to perform these kinds of manipulations (even when they are likely a bad idea) and, who knows, one day even create yet another programming language. You know, the world needs another one!
So, I started looking into grammars, parsers and parser generators. It’s a concept I studied in the language and compilers course at university many years ago and I even wrote my own top-down parser to annotate natural language, but what about a full-fledged grammar?
What is a grammar anyway?
A generative grammar is a way to describe a formal language. “Formal” here means that the rules of the language are explicit, as opposed to “natural” languages like Spanish, and is generally possible to demonstrate whether a string is valid or not according to the rules. As a byproduct we can also get an idea of the structure of a string, assigning labels and finding hierarchies in its content (for example in a programming language we can have a function containing loops containing statements).
Writing a program that can actually do it, however is a complete different matter, and a fascinating topic I recommend if you are interested in theoretical computer science. From now on I’ll focus on the subset of grammars that are used in practice in software development.
Let’s imagine we have the symbols )
, (
, +
, and 8
. We can see how some
strings are valid:
8
8 + 8
8 + (8 + (8 + 8)) + 8
and others are not:
8 + + 8
8 +
8 + ( 8
a generative grammar is based on a set of generation rules, each rule defines
how we can replace a symbol with a sequence of other symbols or terminals
(that is, symbols of th alphabet like the +
here, so called because they are
not replaced by anything). Given a special starting symbol S
you can apply
rules in any order, replacing symbols with other symbols with other symbols
until you have a sequence of only terminals. Every terminal string obtained this
way is valid.
In this case, we can define these rules:
S -> E
E -> 8
E -> E + E
E -> ( E )
let’s take the expression 8 + (8 + (8 + 8)) + 8
:
- We start with symbol
S
- Apply rule 3, it becomes
E + E
- Apply rule 2 to the first
E
, get8 + E
- Apply rule 3 again to the second
E
, get8 + E + E
- Apply rule 2 to the last
E
, get8 + E + 8
- Rule 4 to the only
E
, get8 + ( E ) + 8
- Rule 3, get
8 + ( E + E ) + 8
- Rule 2, applied twice,
8 + (8 + (8 + 8)) + 8
the final string is made only of terminals, and it’s then valid according to this grammar.
Keeping track of what was expanded into what we can transform each expression
into a tree, called AST
(Abstract Syntax Tree).
Given a grammar stored in a file a parser generator (like ANTLR
) can
generate code to parse that language and build the AST.
For the majority of formal languages, like SQL
or Python
, there’s a
generative grammar somewhere. In Python you can even use the ast
module in the
standard library to parse Python code.
Other languages like the Postgres SQL
implementation make it less clear and
you have to use their internal parser
or essentially rewrite the grammar by hand since the parser is integrated in
the code of the application or is something specific for it (see SQLLite’s Lemon).
Realizing this has been baffling for me: formal languages and grammars are around since tens of years and virtually every language has some grammar, so I always assumed that there was some or a few libraries used by everyone, and likewise a mainstream grammar format. This is not really the case!
Another aspect to mention is that usually a grammar is not enough, and you need
a lexer
too. This is a tool that splits a blob into “tokens”, which are the
terminals. You could have each single character as a token, but in reality it’s
usually better to split text in a smart way.
Python and YAML for example make it quite essential: indentation in Python has a semantic role, and it’s extremely complicated to write a grammar to handle the fact each line has the same indentation as the previous one or it’s a different block. The grammar can’t really “remember” the indentation when expanding a symbol. So, usually parser generators come with a lexer (that’s how compiler people call a tokenizer), and some syntactic sugar to integrate it in the grammar definition.
In addition to that, there are a plethora of algorithms to do the parsing:
LR(1)
, LALR
, Earley
, LL
, etc.
The plan: write a grammar for Monicelli first
To delve into this topic (by the way, I used the word “delve” before ChatGPT, and this article is written by a human in meatspace) I decided to write a grammar for a simple language to see how the process works.
Once I have this grammar I can generate a parser for it, and try to integrate it in a web page (that is, run it on the client using JS or WASM) or a Python application. Later, I can try to implement an interpreter or a compiler for it.
I decided to go on and use Monicelli for my experiment. Monicelli is
an esoterical programming language based on the so-called “supercazzole” from the movie Amici Miei, a masterpiece of the Italian comedy
if you don’t speak Italian and don’t get the reference, just know it’s a very simple language. It supports a few variable types, functions, loops, switch-case, print, input, comments, assert and panic.
Since the language already exists I can already find a few examples of Monicelli code I can focus only on the grammar. Besides, any result may be of interest of other people curious about the language.
OK, but how can I actually create a parser??
There are plenty parser generators, and as far as I can see none is considered the “mainstream” one. As mentioned, many languages and tools implement their own thing. Not only that, there’s not even a standard format for grammars! There is the Backus-Naur form but even then there are many dialects and variations.
I looked into ANTLR, Parsimonious, Peggy, Lark, Tree-sitter and probably other five I can’t remember.
ANTLR is one of the most powerful of the bunch, and it can generate Javascript or Typescript, meaning it can be integrated in a web page and even run in a PWA offline, which I consider essential. Another advantage of ANTLR is that there is a wide collection of existing grammars useful for reference and to reuse.
Eventually however I decided to go with Tree-sitter, mainly because of the incremental update feature, the documentation and the large community.
Tree-sitter in a nutshell
Tree-sitter is a parser generator which has an interesting feature: it can react to changes to the input and update the AST incrementally. It can then be used in editors to react to changes made by the user.
It was, in fact, developed originally for the Atom text editor and has integrations for neovim, Emacs, Helix and others.
As with ANTLR, it has plenty existing grammars and can run in the browser. The Python binding is very easy to use, too (in fact every grammar is already a Python, Javascript, Swift and Go package out of the box).
Tree-sitter is particularly focused with use inside editors, and provides out of the box a mechanism to highlight syntax based on a grammar.
The joy of writing a grammar
After trying Tree-sitter and using it to write a grammar I can say I’m pleasantly surprised by how good the whole experience is, considering the task at hand is uncommon and traditionally low-level.
Installing it is as easy as npm install -g tree-sitter
. The tree-sitter
command then has an init
function and everything needed to test and build.
A tree-sitter grammar is essentially a grammar.js
file, and optionally some
C logic to define a lexer (as is done by the Python one) to handle indentation, as mentioned.
It may sound weird that a grammar is written in javascript, but essentially the language allows to easily script out repetitive parts of the language.
A grammar rule can look like this:
non_main_function: $ => seq(
choice('blinda la supercazzola', 'blinda la supercazzora'),
optional($.variable_type),
field('function_name', $.variable_name),
optional(seq('con', $.function_arguments_def)),
'o scherziamo?',
repeat($.statement),
prec(3, $.return),
),
this is defining a symbol non_main_function
as a sequence (seq
) of elements,
some are literal strings (like the o scherziamo?
part), others can be repeated
or named (using the field
helper) or be one of a few options (the choice
).
The command tree-sitter generate
runs the JS code and produces the
corresponding grammar as a JSON file plus a few bindings for the lexer.
Testing the grammar
The process of writing the grammar is quite abstract, and you need to test it. Before starting I expected regression to be a big issue, and to need to write some script to test snippets against my grammar.
I was pleasantly surprised to find tree-sitter comes with a test
command,
which applies the grammar to a set of given input strings and compares the AST
with the expected one.
Queries
Tree-sitter comes with a powerful query function, allowing to match patterns in an AST. This is extremely useful for linters and refactoring tools, but I did’t use it much so far.
Building the grammar and highlighting
A tree-sitter grammar is essentially a package that can be installed from the
beginning. The tree-sitter build
command prepares it and it can be imported
as a Python package (for example).
Since Tree-sitter focuses on usage within an IDE, it also allows to define
syntax highlighting. An AST query can be matched with a tag (e.g. comment
) to
tell an editor which style it can use on a span of text.
Also, the command tree-sitter parse
can immediately produce and show an AST
for a given snippet.
Result
My grammar is here and it took a few hours to write. The errors produced by Tree sitter are quite readable, particularly when handling ambiguity and precedence.
Next steps
This was, so far, an experiment in grammars and parsing. I plan later to try and use this to write tools and hopefully a compiler to WebAssembly. I will report if and when I have interesting results.
Happy 2025!