Compilation Without AST

Dmitriy Kubyshkin
2 weeks, 6 days ago
When I started the work on the Mass language I wanted to challenge some of the established practices of writing compilers with one of them being the presence of abstract syntax tree (AST) as intermediary representation for the program. This month I have admitted defeat and introduced something like an AST. In this post, I want to share some of the details and the reasoning for the switch.

There are two main ways the compilers are written today, both involving an AST. The first, more traditional approach is to have a pipeline with at least the following steps:

tokenization -> parsing (AST generation) -> type checking -> assembly generation-> executable

Some languages introduce more steps including intermediary byte code generation, object files and linking. It is also usually the case that type checking and assembly generations are themselves do multiple traversals over the syntax tree. GCC, Clang, Rust, Go and many other compilers use this setup.

The second approach is what is sometimes called "query-based compilers". The first two steps are the same, but instead of immediately proceeding to type checking the compiler waits for a query about the type of some part of the AST. For the regular compilation, it usually means asking for the type of main entry point which will trigger type checking for all the code that is reachable from main. The same process also is used to show you the type of a variable in IDEs. C# and TypeScript compilers use this architecture.

Out of somewhat prominent compilers, the only one that does not use an AST (from my understanding) is the Tiny C Compiler where the pipeline looks something like this:

tokenization -> assembly generation-> executable

It is also how Mass used to work till now. The natural question is that if it works for TCC, why did I introduce an AST after all? What separates Mass from C is the presence of both the type inference and function overloading. I was able to push the code quite far even with these challenges but in the end, it was just too hacky and resulted in bad machine instructions being generated. Let's consider the following code:

fn foo(s : String) {}
fn foo(s : s64) {}

fn main() {
  foo(some expression here)

When the parser comes across a call to foo it needs to decide which overload to pick. Since the argument is not yet parsed, its type is unknown. In the previous setup, to understand the type we also needed to generate the assembly instructions for the expression, which in turn means putting the result of this expression somewhere, typically the stack. For the String version of foo the argument does need to be on the stack but for the integer version it is wasteful.

The new approach is to parse the code first into a series of typed thunks (closures) that when forced (called) generate the assembly instructions for the operation. The knowledge of type allows picking the right overload before generating any instructions which in turns makes it possible to put the result of the expression directly into the expected storage for the argument.

I'm generally pretty happy with the new setup, but it is not without problems. The biggest one that I need to consider is that because the code is represented as nested thunks, deeply nested expressions can generate stack overflows in code generation. I might just increase stack size or switch to a different representation. We'll see.
2 weeks, 4 days ago
Doing multiple sweeps can be insanely fast by only parsing expressions in the final sweep using a clear separation between statements and expressions. Handle statements using a fast stack parser using multiple sweeps only looking at the first token on each line, then call the expression parser recursively from inside of statements during one of the statement sweeps. You just need to find a basic building block that tells where scopes begin/end and cannot be contained within an expression. Easy for Basic (line-break sensitivity) and Python (explicit scope depth) but impossible for C++ (too nested). Only a few microseconds per updated module, which is nothing compared to trashing and fragmenting the cache with an AST. You can use quick sweeps over pre-tokenized lines while populating a scope stack based on each line's first or last token. Then you don't even have to evaluate the expressions inside the lines until the code is finally generated. Even better is that the separation of expressions take down the complexity from module size to line size, so that you can get much faster results with a dynamic look-up table that can be parsed when the compiler starts. Then you can let the user supply new dialects and extensions at runtime.

You can also call a separate expression parser optimized for simpler types of expressions, like long arrays being initialized. Then split by commas before evaluating each part individually without combining them into quadratic complexity. If done right, a large program can be compiled from zero in an instant without even noticing the delay.

Stack based parser
Module -> Stmt

Dynamic bottom-up parser
Stmt -> Expr

The 1000000X slowness in compiling C++ comes from both headers being compiled multiple times and not separating statements from expressions.
Log in to comment