Should I learn C by writing a C compiler in C? (Part 2)

Checking My Sanity (Semantic Analysis and Type Checking)

With my first step taken, my confidence was sky high and I was ready to take my second step. I had my AST, it would be good to check if it actually meant anything. And if so, what? What is the semantics of the program I had parsed. This sounds quite abstract, but as far as I was concerned, the only things that are absolutely essential are checking what symbols are distinct and checking types. Maybe I forgot something, but if so, it probably wasn't important... I hope.

The problem of checking the distinctness of symbols can be illustrated by the code bellow

int x = 0;
for (int i = 0; i < 10; i++) {
    char* x = "This is an absurd example."
}

In this case there are two different variables called x and they have different types. This means that I can't just do that easy thing and refer to symbols via their names. I have to think about things like scope and keep a table of all the symbols I see in my AST. A symbol table.

This marks the first time that I had to create my own list-like datastructure. I kept the first thing that worked. And I end up with the following.

typedef struct {
    char* name;
    type_t type;
} symbol_t;

typedef struct symtab {
    int len;
    int capacity;
    symbol_t* symbols;
    struct symtab* prev_scope;
} symtab_t;

symtab_t* symtab;

First there's a simple symbol struct to hold the name and a type. Then a little symtab struct to hold a list of symbols in the current scope with a pointer back to the outer scope. The variable len should store how many symbols there are in the symtab and capacity tracks how much memory I've allocated.

If the len is about to exceed capactity, then I need to allocate more memory. Luckily, there is a handy dandy function for expanding existing allocations: realloc. This just takes the old pointer and the new size and handles any memory copies that may or may not be necessary.

Because I didn't care much for the hygiene of my namespace, I decided just to create a global variable symtab which points to the symbol table for the current scope. Then I could just create a little army of helper functions that all mutate this global variable.

When discussing programming with others, I often espouse the virtues of functional programming and avoiding mutation. The swift disappearance of my principles was one of the first signs of mental strain that this project was putting on me. There are probably some reputable programmers out there that would defend this approach. Or at least that was what I was telling myself at the time. That was good enough for the version of me that just wanted to keep moving forward.

Since it seemed possible, I decided to build my symbol table and check types in one big recursive function that walks over all the AST nodes. Luckily, the C type system is quite simple. In fact, calling it a type system is really a bit of a stretch. Types are always explicitly annotated, so I don't need to do any type inference and the types of any expression can be determined from the types of its children. Easy!

So I make a nice recursive function that when given an AST node returns its type. I can consider every astnodekind separately. For literals, just return the literal type which was determined by the parser. For variables, look up the type in the symtab and return that. For expressions, check the types of the child AST nodes by calling the function recursively and follow some simple rules to determine the result type. E.g. INTEGER + INTEGER = INTEGER and INTEGER > INTEGER = BOOLEAN (Note that C does not have booleans, but I didn't know this and it's already supported in my AST and parser so I just kept it). For larger structures like for loops and if statements I just check the types of their children and return a special VOID type. And finally, for variable declarations, I compare the declared type with the type of the expression on the right hand side and store the type in the symtab.

Some care needs to be taken to call the symtab helper functions to add and remove scopes as we pass in and out of blocks from for loops or similar. But this works quite well. A fragment of this function is shown below.

type_t check_ast(astnode_t* node) {

    ...

    case DECLARATION:;
        astnode_t* ident = node->children[0];
        t = check_ast(node->children[1]);
        if (t != node->litkind) {
            printf("Declared type %d does not match given expression %d\n", t, node->litkind);
            exit(1);
        }
        add_symbol(ident->ref, node->litkind);
        break;

    ...

    case WHILE:
        new_scope();
        t = check_ast(node->children[0]);
        if (t != BOOLEAN) {
            printf("Non boolean expression in while loop\n");
            exit(1);
        }
        check_ast(node->children[1]);
        pop_scope();
        break;

    ...

    case VAR:;
        symbol_t* sym = find_symbol(node->ref);
        if (sym == NULL) {
            printf("%s is used before it is defined\n", node->ref);
            exit(1);
        }
        return sym->type;

Since I kept the boolean types, I might as well require that only booleans are used in while loops and if conditions. But in doing so, it seems I've already implicitly given up on making a real C compiler. Oh well!

Also, when writing this code, I started to regret how I decided to structure astnode_t earlier. From looking at the code above, I have no idea what each of the children are. I just have to remember that node->children[0] is the condition for a while loop and node->children[1] is the body. Plenty of room to make mistakes here! But it worked well enough.

Semantics and type checking: Done!

Making Stuff Up (Intermediate Representation)

A long time ago, I watched a presentation by an LLVM maintainer. He said:

LLVM IR is always in SSA Form! There's no exception!

— Johannes Doerfert

I don't know what SSA form is, but I know that these LLVM guys know something about compilers. So probably it's worth looking into. Being honest, SSA form had only a minor role to play in making my implementation more complicated. The relevant part is that SSA form is constraint you can put on a Linear Intermediate Representation (linear IR) and the concept of a linear IR was the actual important part for me.

In its essence, a linear IR is a stepping stone between the AST and the final generated assembly. It's a simplified abstract instruction set that captures the essence of assembly without needing to worry about details like which registers to use. To my understanding the term linear refers to the key property that abstract instructions execute in sequence from top to bottom (unless there's a branch abstract instruction). You could compare this with another popular intermediate representation: a control flow graph. Which is obviously a graph and not very "linear".

I could have used a well established implementation of a linear IR, like the one used in LLVM or GCC. However, that would mean one more thing to learn and I was not in the mood for that. By creating my own abstract instruction set, it meant that I could make things up as I go along. If I couldn't figure out some detail, I could always add a special case instruction to make it easier (pushing the fundamental problem to future me). This flexibility was too powerful to give up.

So in my abstract instruction set I decided to give myself an infinite number of registers. I created just the bare minimum of instructions for arithmetic and a conditional branch instruction. As opposed to assembly, these instructions never mutate their arguments. This means that most instructions use three registers, two inputs and one output. I also stole some ideas from LLVM and decided to organize instructions into groups called "blocks". Each block contains at most one branch instruction at the end. This means that all control flow happens between blocks and execution within a block always goes from top to bottom. Below is an example from one of my test cases.

block: 7
        @20 =   MOV     #1
        @21 =   SUB     @14 @20
        @22 =   MOV     @21
                JMP     !6

Because I was unconstrained by societal norms, I used @ to refer to registers and not %. I also use ! to refer to blocks in branch or jump instructions. This example subtracts 1 from the value in register 14 and stores the result in register 21. It then moves that value into register 22 before jumping to block number 6. I know that both MOVs are unneeded, but asking for efficiency is asking for too much right now. Since I have given myself an infinite number of registers, I just create fresh new ones for each result.

This brings us neatly back to SSA form. Which stands for Single Static Assignment. This constrains a linear IR to only assign a register once, after which it is immutable. Apparently, this makes it easier to write many kinds of optimizations. Something I never did and therefore got no benefit from. What I did experience is the new challenge that it brings. Consider this code snippet below.

int x = 0;
x++;

Since SSA requires that registers are immutable, each operation on x must get a new register. This is not a huge pain, since I have a good amount of registers to work with. The real pain comes when we mix in conditional branches.

int x;
if (condition) {
    x = 0;
} else {
    x = 1;
}
int y = x + 1;

I cannot use the same register for the variable x in both branches of the if statement, because each register can only have one assignment. Then when generating instructions for int y = x + 1, which register should I use to refer to the current value of x? I would need to know which code path would be taken during execution. Which is impossible!

The "solution" is to introduce a magical instruction, typically called PHI. This instruction magically knows which code path was taken. It takes two registers which come from two separate code paths as input and outputs the register from the code path that was executed. This means that we can use PHI to coalesce the two registers that represent x into one new register without violating the SSA requirement! So the above if statement could become something like what's shown below.

block: 13
        @17 =   MOV     #0
                JMP     !15
block: 14
        @18 =   MOV     #1
                JMP     !15
block: 15
        @19 =   PHI     [@17 !13] [@18 !14]
        @20 =   ADD     @19 #1

Blocks 13 and 14 are the true and false branches of the if statement which store their updates to x in registers 17 and 18 respectively. The PHI then moves the correct value into register 19 depending on if we jumped from block 13 or 14. Another minor restriction of SSA form is that PHI instructions can only come at the start of a block, but this is not an issue.

Now that I've got a general idea of what I'm going for. There's still the question of how to implement any of this. Since recursively walking the AST worked so well for me last time, I decided to try forcing the same solution again. Surprisingly it worked.

The idea was to create more global mutable state. When processing an AST node, I could add abstract instructions to this global datastructure. Then the return value of each recursive call could just be the register that contains the result of whatever was computed in the child nodes.

For expressions, with works remarkably well. The code below handles appending abstract instructions via the append_cmd helper function for literals, addition and subtraction. The append_cmd helper always returns the register that holds the output.

struct continuation emit_ir_helper(astnode_t *node,
                                   basic_block_t* current) {

        ...

    switch (node->kind) {
        case LITERAL:
            reg = append_cmd(current,
                             COMMAND_MOV,
                             node->value.integer,
                             0,
                             TRUE,
                             FALSE);
            return (struct continuation){reg, current};
        case EXPRESSION:
            instruction_t* left =
                emit_ir_helper(node->children[0],
                               current
                ).expr_value;
            instruction_t* right =
                emit_ir_helper(node->children[1],
                               current
                ).expr_value;
            switch (node->op) {
                case ADD:
                    reg = append_cmd(current,
                                     COMMAND_ADD,
                                     left->reg_out,
                                     right->reg_out,
                                     FALSE,
                                     FALSE);
                    return (struct continuation){reg, current};
                case SUB:
                    reg = append_cmd(current,
                                     COMMAND_SUB,
                                     left->reg_out,
                                     right->reg_out,
                                     FALSE,
                                     FALSE);
                    return (struct continuation){reg, current};

        ...

I'm not sure if the code above illustrates anything insightful. But I want to show the terse rubbish code I actually wrote and used rather than rewrite this snippet as something that's cleaner and easier to read. That way you can get a better sense for the amount of time I was spending on writing good C versus the time I was spending on solving the problem at hand.

For variable assignments and references, I need a datastructure to keep track of which registers correspond to which variables. I copied the dynamically expanding array code that was used for the symbol table and tweaked the logic to track registers instead of types. Different structs has different sizes. But of course, when copying code, I forgot to change the amount of memory being allocated by realloc. But this wasn't an issue until much later when I started using larger test cases that actually required expanding the array. This would corrupt my heap. What should have been read-only datastructures were changing randomly and new calls to malloc or realloc would fail with some arcane error about the free list being damaged. Fun! I now always triple check what I put in calls to sizeof().

For control flow, the idea is the same, but in addition to appending instructions, I will also need to create new blocks. For example, this is what generating abstract instructions for an if statement looks like.

struct continuation emit_ir_helper(astnode_t *node,
                                   basic_block_t* current) {

        ...

        case IF:
            basic_block_t* true_branch = new_block();
            basic_block_t* false_branch = new_block();
            basic_block_t* next = new_block();

            // result of the if condition is put in reg
            reg = emit_ir_helper(node->children[0],
                                 current
            ).expr_value;

            // Add a branch instruction that goes to correct block of the if statement
            append_br_cmd(current,
                          reg->reg_out,
                          true_branch->id,
                          false_branch->id);

            // Generate code for the true and false branches
            struct continuation true_cont =
                    emit_ir_helper(node->children[1],
                                   true_branch);
            struct continuation false_cont =
                    emit_ir_helper(node->children[2],
                                   false_branch);

            // Add jumps at the ends of the two blocks to the "next" block
            append_cmd(true_cont.control_flow_cont,
                       COMMAND_JMP,
                       next->id,
                       0,
                       FALSE,
                       TRUE);
            append_cmd(false_cont.control_flow_cont,
                       COMMAND_JMP,
                       next->id,
                       0,
                       FALSE,
                       TRUE);

            // next is the new current block
            current = next;
            break;

        ...

Except this time I am lying about my code. The real code doesn't look nearly as simple. Because of those lovely PHI instructions, I needed to keep track of all modifications to variables in order to know which PHI instructions were required. When handling loops this was even more complicated because I then also needed to generate PHI instructions before the loop condition code is generated. This required knowing which variables would be modified before even generating the instructions for either the loop condition or the loop body. In a mad desperate hack, I generated instructions twice! Once to know what variables will be modified, and then a second time with the PHI instructions in place.

The C gods rewarded my descent into insanity. After resolving some simple segfaults the above strategy actually worked. As far as I could tell at least, since I wont get to see any real results until I have real assembly.

At this point, I feel like I can see some light coming from the end of the tunnel. I've written a bit more than a thousand lines of C by this point which is a nice round number to cross. Given that I have something that looks a little bit like assembly already, there shouldn't be too much work left, right?

Find out next time in Part 3 on Code Generation.