There are two main ways the compilers are written today, both involving an AST. The first, more traditional approach is to have a pipeline with at least the following steps:
1 | tokenization -> parsing (AST generation) -> type checking -> assembly generation-> executable |
Some languages introduce more steps including intermediary byte code generation, object files and linking. It is also usually the case that type checking and assembly generations are themselves do multiple traversals over the syntax tree. GCC, Clang, Rust, Go and many other compilers use this setup.
The second approach is what is sometimes called "query-based compilers". The first two steps are the same, but instead of immediately proceeding to type checking the compiler waits for a query about the type of some part of the AST. For the regular compilation, it usually means asking for the type of main entry point which will trigger type checking for all the code that is reachable from main. The same process also is used to show you the type of a variable in IDEs. C# and TypeScript compilers use this architecture.
Out of somewhat prominent compilers, the only one that does not use an AST (from my understanding) is the Tiny C Compiler where the pipeline looks something like this:
1 | tokenization -> assembly generation-> executable |
It is also how Mass used to work till now. The natural question is that if it works for TCC, why did I introduce an AST after all? What separates Mass from C is the presence of both the type inference and function overloading. I was able to push the code quite far even with these challenges but in the end, it was just too hacky and resulted in bad machine instructions being generated. Let's consider the following code:
1 2 3 4 5 6 | fn foo(s : String) {} fn foo(s : s64) {} fn main() { foo(some expression here) } |
When the parser comes across a call to foo it needs to decide which overload to pick. Since the argument is not yet parsed, its type is unknown. In the previous setup, to understand the type we also needed to generate the assembly instructions for the expression, which in turn means putting the result of this expression somewhere, typically the stack. For the String version of foo the argument does need to be on the stack but for the integer version it is wasteful.
The new approach is to parse the code first into a series of typed thunks (closures) that when forced (called) generate the assembly instructions for the operation. The knowledge of type allows picking the right overload before generating any instructions which in turns makes it possible to put the result of the expression directly into the expected storage for the argument.
I'm generally pretty happy with the new setup, but it is not without problems. The biggest one that I need to consider is that because the code is represented as nested thunks, deeply nested expressions can generate stack overflows in code generation. I might just increase stack size or switch to a different representation. We'll see.