Should I learn C by writing a C compiler in C? (Part 1)

If you want to learn many things, is it best to learn them one at a time? Or is it best to find the perfect project that combines all the things you want to learn? Will the effects of cognitive overload be too great to handle? Or will the added struggle just add to the effect of learning? In an attempt to learn both C and compiler design at the same time I unwittingly experiment on myself to aid in finding answers to these questions. My simple intuition says that this will be a disaster, but intuitions must be tested. They are sometimes wrong.

I started this project because of a nagging quote from an old professor.

Do not use tools that you do not understand and could not build yourself!

— My Prof

Generally, I find this advice to be impractical on a day-to-day basis. There are 100s of tools that any developer might need for their day job. Gaining a thorough understanding of them all would be a colossal task. And it's far from guaranteed that any of this knowledge would be useful or would even change how I use these tools. A compiler, however, is a tool that I use 100s of times every (work)day. And I need it. No Golang dev can work without a go compiler. No Java dev can be productive without javac. I don't like needing something I do not understand. And what better way is there to learn than to build?

I chose to build a C compiler, because most online resources discussed C compilers and C is simple compared to other similarity popular languages. But what language to write the compiler in was a difficult decision. When I started this project, I was working as a Java dev. So choosing Java was not an option. I had also been looking for an opportunity to learn a bit of C. So it seemed like a natural idea for me to strike two birds with one stone. Let's write a C compiler in C to learn both C and compilers.

My Starting Point

I don't want to give the false impression that I am starting from absolutely nothing. C programming used during some of my university classes. So I know enough to write simple programs. But there's still a lot more for me to learn.

Regarding compilers, I at least knew that there was no magic. A compiler is just like any other ordinary text manipulation program. It takes in text and spits out something that your CPU can run. In my case, that means spitting out x86-64 assembly in some form that linux can run.

I also knew that most compilers go through a sequence of steps that can be considered independently:

Lexing: The process of taking raw text and turning it into simpler tokens. For example, simple characters like ( could stay the same, but numbers like -187 could be mapped to a NUMBER token.
Parsing: This step takes the sequence of tokens from the lexer and builds up a data structure which represents the program. This works via a "grammar", which like grammar rules for natural language, defines rules for how the compiler language should look. For example, a sequence of tokens for a name, a (, another name and a ), could become a data structure which represents a function call taking one argument.
Semantic Analysis: Here we need to check if the result we get from parsing makes any sense. For example, are any variables used before they are defined? Are the types correct? Do different variables share the same name? etc.
Lowering: This is where the compiler takes everything done so far and transforms it into a simpler form that usually resembles assembly. This form is called the intermediate representation, and there might be multiple intermediate representations used by the compiler.
Optimization: The compiler can then analyze the intermediate representation and make certain optimizations. For example if the compiler sees that a variable it copied multiple times in sequence, then it might be possible to remove some of those copies. Optimization isn't strictly necessary, so I won't be considering any of this.
Code generation: Finally, the compiler transforms the intermediate representation into assembly suitable for running on the current machine. For me this is x86-64 assembly, but on recent apple machines this would be ARM assembly. I will only consider x86-64 assembly.

This captures the full extent of my knowledge before starting. It should also be all the knowledge needed to understand the rest of this post. This is not meant to be a "how to" on compilers, but more of a journey of personal suffering and learning. If something doesn't make sense to you. Don't fret. It's about the journey, not the destination.

Learning & Lexing & Parsing

Leaving aside the detail that building a compiler for a language you don't know means that you don't know what features your compiler will need. Lack of knowledge is not an excuse for not starting somewhere. I had skimmed through the great and free Introduction to Compilers and Language Design by professor Douglas Thain, so I at least had a general overview that compilers move through a pipeline that starts with lexing and parsing, then building some kind of intermediate representation, then some kind of magic happens and your program is compiled. So lexing and parsing was a good place to start.

Prof. Thain suggests using a tool called bison, which is a parser generator in C. Using a parser generator instead of hand rolling a parser seemed like a good way to get started quickly without getting bogged down in the theory of writing parsers. I was reading through the example on building a calculator from the bison documentation. To my surprise, the documentation was easy to follow and light on parser theory. With only a little bit of reading and tinkering, I was able to set up a simple calculator which supported infix operators with correct operator precedence. The bulk of the work is being done by the snippet below which defines the grammar.

expression:
    NUMBER
    | '(' expression ')'        { $$ = $2; }
    | expression '+' expression { $$ = $1 + $3; }
    | expression '-' expression { $$ = $1 - $3; }
    | expression '*' expression { $$ = $1 * $3; }
    | expression '/' expression { $$ = $1 / $3; }

The grammar is represented hierarchically. Bigger structures, like expressions, are defined in terms of smaller structures like sub-expressions or simple tokens like ( or NUMBER. Reading the above snippet in plain English would sound something like this: An expression is either:

A number
An expression surrounded by brackets
An expression followed by a + and then another expression
An expression followed by a - and then another expression
An expression followed by a * and then another expression
An expression followed by a / and then another expression

In this example, expression is defined recursively with NUMBER as the base case. On the right of each grammar rule lives a little piece of logic for what to do when parsing a that rule. With just that, I had a working calculator.

$ ./calc
1 + 1
=> 2.00
2 * (3 + 4)
=> 14.00

Just 7 lines for the heart of a calculator in C! Having something that could run was a major confidence booster. Although a calculator is not a compiler, I now thought I knew enough to tackle parsing for something a little more complicated.

To speed things up a little, a used a flex to generate the lexer. It integrates easily with bison and lets me define all my complex tokens with regex. I don't need to start completely from scratch either, because much of the expression grammar can be copied from my calculator project. The main new challenge is that instead of having bison do math on floats as soon as it parses each expression, it now needs to incrementally build up the Abstract Syntax Tree (AST). Additionally, bison only lets you work with one type when parsing: the "Semantic Value Type". This means that I have to find some type that can equally well represent loops, expressions or simple integer literals. So I just tried to jam anything I thought could be useful into one struct.

typedef struct astnode {
    struct astnode** children;
    struct astnode* next;
    enum astnodekind kind;
    enum oper op;
    type_t litkind;
    union {
        char* string;
        int integer;
    } value;
    char* ref;
} astnode_t;

I typedef'ed this struct as astnode_t because that helped me forget that I had no idea what I was doing. Since AST has tree in the name, it seemed right to have a pointer to an array of child nodes. Which is a simple double pointer. Later, I will have quite a challenge keeping track of how many children there are for any given astnode_t because I did not create a field indicating the number of children. But it's good enough for now.

Separate from the array of children is a simple pointer to another "next" astnode_t. I thought I need this because most programming languages can't be easily thought of as a single tree. Take for example this simple program.

printf("Think of a number between 0 and 10.");
sleep(5);
printf("You're thinking of the number 7!");

Each individual line could be thought of as a tree with function calls at the root and with the arguments at the leaves of the tree. But how would the separate lines relate to each other? I decided to think of this as a linked list of ASTs. So I threw in a struct astnode* next; so that my AST is now both a tree and list at the same time!

Moving on to the data in each node. Since everything was going to become an astnode_t I would need some field to disambiguate all the different syntax structures being parsed. I saw that C supported enums, and this seemed like the natural choice.

enum astnodekind {
    LITERAL,
    ASSIGNMENT,
    DECLARATION,
    VAR,
    IF,
    FOR,
    WHILE,
    EXPRESSION
};

Astute readers will notice that there is no entry for things like function calls, or function definitions, or struct definitions. I thought that it was sensible to start with the basics. Then I would come back later to add these features. This never happened.

I would also later learn that in C, enum members are not scoped to the enum. When I learned this, I was already too lazy to bother changing it.

Other than that, the astnode_t can hold the type and value of any literals, an operand in the case of parsing expressions and a named reference ref which could be used for variables. Good enough for now. I could always come back and clean it up or add more fields later. Of course this also never happened.

I could now use bison to build an astnode_t for each of the syntax structures it parses and then to wire all the pointers together so that I get my linked list of ASTs. Through an excessive (ab)use of helper functions to create various instances of astnode_t, the whole parser was coming together. To verify that I had a mostly working parser, I wrote some logic to print out the parsed AST. And it was _beautiful_.

int total = 20;
for ( int i = 0; i < 10; i = i + 1 ) {
    total = total + i;
}

becomes

DECLARATION
  VAR: total
  20

FOR
  DECLARATION
    VAR: i
    0

  OP:  LT
    VAR: i
    10
  ASSIGN
    VAR: i
    OP:  ADD
      VAR: i
      1

  ASSIGN
    VAR: total
    OP:  ADD
      VAR: total
      VAR: i

Parsing: Done!

As an aside, playing with a parser generator like bison was a lot of fun, and it is surprising how much you can get done with minimal code. If you have never tried it, I would definitely recommend playing around with a parser generator. I've heard good things about ANTLR which supports many languages like Java, Golang, JavaScript and more.

Seeing the output of my parsed AST massively boosted my confidence. But have I really learned anything or have I just delegated all the hard work to a clever tool? Going forward there won't be similar tools that can do the heavy lifting.

Find out in Part 2: Semantic Analysis & Lowering