Meta Assembly Language»Blog
Dmitriy Kubyshkin
It has been a couple of months since the last update. As usual, there are a lot of internal changes to the compiler, but there are some interesting externally visible changes as well. I spent some time tightening things that seem stable enough, such as tokenization, encoding, and functions.

For the tokenization, I have switched from a hand-written tokenizer to re2c. It provided a nice boost to the compilation speed almost for free. I have also implemented caching for common number literals (0-9) and identifier names for another small boost.

With encoding, once I decided to ignore possible machine code size improvements from a relaxation step, I have realized that I could eagerly turn assembly instructions into bytes which save a whole lot of memory and some compilation speed. You can see me working on it in a YouTube video.

The big thing with functions is the basic support for generic arguments. Right now there is no way to constrain the type when matching overload, but even this basic functionality allowed moving type definitions of some of the compiler intrinsics into the user land:

1
2
type_of :: @fn(x) -> (Type) PRELUDE_MASS.type_of
size_of :: @fn(x) -> (Number_Literal) PRELUDE_MASS.size_of


Generic function support also tightened up internals quite a bit allowing to move allowing to move some of the previously intrinsics to the user land. The main example of this is the fixed-size array type function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Array :: fn(item : Type, length : u64) -> (Type) {
  mass :: PRELUDE_MASS
  byte_size :: mass.Descriptor.bit_size.as_u64 * 8
  byte_alignment :: mass.Descriptor.bit_alignment.as_u64 * 8
  t : Type = mass.allocator_allocate_bytes(mass.allocator, byte_size, byte_alignment)
  t.tag = mass.Descriptor_Tag.Fixed_Size_Array
  t.bit_size.as_u64 = item.bit_size.as_u64 * length
  t.bit_alignment = item.bit_alignment
  t.Fixed_Size_Array.item = item
  t.Fixed_Size_Array.length = length
  t
}


There were also a bunch of smaller notable changes:

  • Linux Syscall Support
  • Basic Tuple Support
  • Support specifying output path for the CLI
  • Moved arithmetic operator definitions to the user land
  • Support bulding with Clang-CL on Windows
  • Significantly simplify and speed up `using` implementation
Dmitriy Kubyshkin
The major item this month for me was the feature between Windows and Linux JIT implementations. Getting Linux to run some bytes I passed to it was the easy part and took like 30 min to do .
The hard part was getting System V ABI to a workable state. One reason is the fundamental complexity of the algorithm used to determine how arguments are packed to register, but it is also made worse by the fact that the technical documentation leaves much to be desired.

Also on the Linux front there is now support for referencing dynamic libraries in the JIT:

1
2
3
4
5
6
7
8
write :: fn(descriptor : s32, buffer : &u8, size : u64)
  -> (s64) external("libc.so.6", "write")
STDOUT_FILENO :: 1
main :: fn() -> () {
  hello :: "Hello, world!\n"
  write(STDOUT_FILENO, hello.bytes, hello.length)
  import("std/process").exit(0)
}


Besides a ton of refactoring here are some more highlights:


Dmitriy Kubyshkin
The majority of time since the last update was taken by abstracting away calling convention code to prepare for System V ABI used by Linux and Mac. With that done, adding Linux JIT setup was a breeze. There is still lots to be done for proper System V ABI compatibility as well as dlopen integration, but it is pretty exciting to have a custom backend that works across multiple platforms.

Besides that there were also many smaller changes:


Dmitriy Kubyshkin
The majority of the development up until about a month ago was about figuring out the basic features of the compiler. As you start to combine them together new and unexpected cases need to be solved. Besides the common language issues such as signed / unsigned integer handling Mass has a lot of its own problems to solve. The most tricky one is handling the boundary between compile-time execution and runtime code. So far there is no production language with the same power as what I aim for so there is no real way to know what is the correct way to do it and this is what I spend a lot of time on.

Besides the robustness are the things that have been added to the compiler in the last month:

  • basic embedded debugger REPL
  • uniform and typed errors in the compiler
  • conditional constant definition via "using" and "if" expression
  • compile-time function definitions
  • static_assert() implemented using meta-programming
  • function default argument type inference


There are some exciting things I plan for the coming months and already looking forward to the next update.
Dmitriy Kubyshkin
When I started the work on the Mass language I wanted to challenge some of the established practices of writing compilers with one of them being the presence of abstract syntax tree (AST) as intermediary representation for the program. This month I have admitted defeat and introduced something like an AST. In this post, I want to share some of the details and the reasoning for the switch.

There are two main ways the compilers are written today, both involving an AST. The first, more traditional approach is to have a pipeline with at least the following steps:

1
tokenization -> parsing (AST generation) -> type checking -> assembly generation-> executable


Some languages introduce more steps including intermediary byte code generation, object files and linking. It is also usually the case that type checking and assembly generations are themselves do multiple traversals over the syntax tree. GCC, Clang, Rust, Go and many other compilers use this setup.

The second approach is what is sometimes called "query-based compilers". The first two steps are the same, but instead of immediately proceeding to type checking the compiler waits for a query about the type of some part of the AST. For the regular compilation, it usually means asking for the type of main entry point which will trigger type checking for all the code that is reachable from main. The same process also is used to show you the type of a variable in IDEs. C# and TypeScript compilers use this architecture.

Out of somewhat prominent compilers, the only one that does not use an AST (from my understanding) is the Tiny C Compiler where the pipeline looks something like this:

1
tokenization -> assembly generation-> executable


It is also how Mass used to work till now. The natural question is that if it works for TCC, why did I introduce an AST after all? What separates Mass from C is the presence of both the type inference and function overloading. I was able to push the code quite far even with these challenges but in the end, it was just too hacky and resulted in bad machine instructions being generated. Let's consider the following code:

1
2
3
4
5
6
fn foo(s : String) {}
fn foo(s : s64) {}

fn main() {
  foo(some expression here)
}


When the parser comes across a call to foo it needs to decide which overload to pick. Since the argument is not yet parsed, its type is unknown. In the previous setup, to understand the type we also needed to generate the assembly instructions for the expression, which in turn means putting the result of this expression somewhere, typically the stack. For the String version of foo the argument does need to be on the stack but for the integer version it is wasteful.

The new approach is to parse the code first into a series of typed thunks (closures) that when forced (called) generate the assembly instructions for the operation. The knowledge of type allows picking the right overload before generating any instructions which in turns makes it possible to put the result of the expression directly into the expected storage for the argument.

I'm generally pretty happy with the new setup, but it is not without problems. The biggest one that I need to consider is that because the code is represented as nested thunks, deeply nested expressions can generate stack overflows in code generation. I might just increase stack size or switch to a different representation. We'll see.