Should I learn C by writing a C compiler in C? (Part 3)

Confronting Reality (x86-64 Code Generation)

The Pareto Principle states that 80% of results come from 20% of the effort. The mistake is then to look at your impressive results (from part 2) and conclude that you are 80% done. The last 20% take the remaining 80% of your effort. You're not even halfway to the halfway point.

Anyway! Now that I had an intermediate representation that looked like a real assembly language when squinting from a distance, there was only one real step remaining: To turn my abstract instructions into real x86-64 assembly instructions.

Perhaps there exists a 1 to 1 mapping of my abstract instructions to real x86-64 instructions.

— An optimistic version of myself

There are a few barriers to this:

x86-64 machines only have access to 16 registers. Any code compiled down to my intermediate representation would use more than this because I assumed I had access to an infinite number of registers.
Magic isn't real and PHI instruction don't exist.
Some of my abstract instructions might have been a bit too abstract and their functionality doesn't cleanly map onto a single x86-64 instruction. The main issue here was my choice to define a single conditional branch abstract instruction that reads a register and jumps to a location depending on the value in the register. With this one abstract instruction I could represent all conditional branching. I would need to find a way to map this one abstract instruction onto the many x86-64 conditional branch instructions that depend on flags e.g. jne, jge, jle, etc. My abstract instruction set has no concept of flags, so this is going to be a messy translation.
At the time of starting this portion, I knew virtually nothing about x86-64.

The first two sounded rather daunting (I only become aware of issue 3 after dealing with issue 4). So I tackled 4 first and set forth to learn a little bit of x86-64 assembly.

At this point in the project, I was tired and lazy. I searched for the 100 most commonly used instructions and quickly scanned through the list comparing them against what I had defined for my abstract instruction set. I quickly found all the mathematical instructions I would need and noticed my issue with the single branching instruction. Luckily, I spotted a "niche" instruction SETcc which can set a register based on value of certain flags. This would let me bridge the gap between my flag free abstract instructions and real x86-64 and solve my branching issue.

I also read some the assembly output from gcc -S loop.c for a handful of short C programs I wrote.

.L4:
	movl	-4(%rbp), %eax
	andl	$1, %eax
	testl	%eax, %eax
	jne	.L3
	movl	-4(%rbp), %eax
	addl	%eax, -8(%rbp)
.L3:
	addl	$1, -4(%rbp)
.L2:
	movl	-4(%rbp), %eax
	cmpl	-20(%rbp), %eax
	jl	.L4

This fragment is the assembly for a simple for loop.

for (int i = 0; i < iters; i++) {
    if (i % 2 == 0) {
        sum += i;
    }
}

Reading through snippets like this and cross-referencing what each of the instructions are doing was actually quite effective method for getting up to speed with the basics. It took a long time to learn how to read the different addressing modes and what the different "general purpose" registers typically mean (e.g. rbp is typically used to point the base of the current stackframe). But always having a concrete example of some assembly I wanted to understand made it easier to stay focused on what's definitely necessary and what can be skipped.

For anyone who wants a hint on how to understand the assembly. .L4 is the main body of the loop. .L3 is incrementing the loop variable. .L2 is checking the loop condition.

With an absolutely bare-bones understanding of x86-64, I moved my attention to the PHI abstract instructions which have no analogue in the x84-64 instruction set. Luckily, I found a nice paper: Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency. This paper claims that removing PHI instructions is a tricky, error-prone process. However, the same paper offers me a lifeline.

Some compilers restrict SSA to CSSA (conventional SSA), as going out of it is straightforward. In CSSA, all SSA variables connected (possibly by transitivity) by phi-functions have non-overlapping live ranges. They can thus be all replaced by the same name without changing the semantics of the code, like with code obtained just after SSA construction.

— B. Boissinot, et al.

Just replace all virtual registers involved in a PHI instruction with one single register! Easy! This paper mentions that maintaining conventional SSA makes writing optimizations much harder. This was no barrier for me, as just getting this thing to work at all would require a small miracle. Optimizations can be added later (never).

This leaves the final issue. How to reduce my use of registers from infinite to 16? I was aware of two methods:

Linear Scan
Graph Colouring

My only concern was simplicity, so linear scan stood out to me because it had the word linear in it. I had a linear intermediate representation, so a linear scan should be a great fit. However, after reading Linear scan register allocation on SSA by Max Bernstein and Linear Scan Register Allocation in the Context of SSA Form and Register Constraints, I just could not get it to click. Also, all descriptions of the algorithm that I read used SSA form, but by this point I had already destroyed my SSA form by removing PHI abstract instructions. I think I must be missing something simple, but I don't know what.

I gave graph colouring a chance to see if was any easier. Some helpful slides from EPFL, gave me the rough idea. To my surprise, the idea is not as complicated as it first sounds.

Create a graph where every virtual register is a node in the graph.
If two virtual registers are "live" at same time, then create an edge between them.
Colour the graph with an algorithm of your choosing.
Assign each colour to a physical register.

Putting aside what it means for a virtual register to be "live", much of the complexity of graph colouring approaches seems to me to come from step 3. The common algorithm to use is Chaitin-Briggs. But I can just use simple greedy colouring.

This only leaves the question of what it means for a virtual register to be "live"? In simple terms, a virtual register is live from when it is defined until the last time it is used. If two virtual registers are not live at the same time, then they can both be assigned to the same physical register. To calculate liveness, just traverse your program in reverse. All virtual registers start as dead. The first time you see a virtual register being used it becomes live. Then when you see its definition it becomes dead again.

Since I am no longer in SSA form, a virtual register might have multiple definitions. But this would only happen on separate code paths. On each individual code path, a variable would only be defined once.

After computing which virtual registers were live at which points, I could build the graph and start adding in the edges. To build a graph in C, I created a struct to represent a graph node (gnode) and a struct for the whole graph (reg_graph). Each graph node will store a list of pointers to its neighbours and the reg_graph struct will have pointers to all the gnodes for easy access. The end result looks like this:

struct gnode {
    int ssa_num;
    struct reg_names* assignment;
    int num_neighbours;
    int neighbour_capacity;
    struct gnode** neighbours;
};

struct reg_graph {
    int num_nodes;
    int capacity;
    struct gnode** nodes;
};

struct reg_graph graph;

Again, with a simple global variable called graph because I've completely given up on keeping my namespace clean. A bunch of helper functions for adding nodes and edges complete my simple graph implementation.

I can now build and colour my graph. Since every graph node represents a virtual register, I can replace that virtual register with the physical register represented by that colour. Despite using greedy colouring, my test cases typically only used three physical registers. This was probably because of the many redundant MOV instructions, which made the graph simpler.

There was now one optimization I could do that was too simple to ignore. Which was to remove MOV instructions where the source and destination registers were the same. Given that my test programs used very few registers, these types of MOV instructions were more common than I would have initially guessed.

With all the parts in place, the moment of truth arrives. I assemble my assembly output with GNU as and run it. All my work has built up to this, and after running my complied program I see a familiar friend. SEGFAULT!

That was quite surprising to see given that my test program (shown below) does not allocate any memory. What was actually happening is that code execution did not stop when my program ended. So it kept executing until it hit some invalid region. Given that I was executing complete garbage, I am lucky that a segfault was the issue I ran into.

int total = 20;
for ( int i = 0; i < 10; i = i + 1 ) {
    total = total + i;
}

while ( ! ( total <= 3 ) ) {
    total = total - 1;
}

int result = total * 2;

I also learned a trivia fact. Execution of C programs does not start at main! In most cases, it actually starts at _start. For C programs, this start function is typically provided by the C runtime and can be found on many systems as crt0.o or crt1.o. From _start, the C runtime will handle some simple initialization, set up argc and argv and call main. It will also handle the return from main and pass the exit code to the operating system.

That last part was also the solution to my segfault issue. I needed to manually pass control back to the operating system. Since I only cared about linux, this meant that I needed to call the exit syscall. To call exit, I needed to put the value 60 into register rax and then execute the syscall instruction. The exit code would be taken from the rdi register.

So at the end of my code generation pass, I appended a snippet to exit gracefully with exit code 0.

	movq	$60, %rax
	xorq	%rdi, %rdi
	syscall

And with that my segfault was solved, and I had a working executable!

But sadly I can't see what my program is doing. I decided to add a final step to link against libc with ld and then add some extra instructions to print the value in rax.

With that final piece in place, I compile my compiler.

I compile a test case with my compiler.

I run it.

The result?

Pure bliss.

Was There A Point To All This?

If I can be honest with myself, this is clearly not a C compiler. There are no functions, no structs. Pointers don't exist at all. There's no stack, so if I run out of registers then the compiler crashes. At best, I could describe my compiler calculator hybrid which supports loops conditionals and variables.

It's not entirely clear to me when all of these features fell out of scope. In my attempt to make progress towards an end goal, everything that wasn't absolutely essential to capturing the spirit of what I imagined as a compiler, was cut out. I recall noting that I could always come back later and add features to the parser or to the generation of my intermediate representation. But all software engineers know that later = never and temporary = forever. And by the time I got an executable output from my "compiler", I felt that the experiment was over.

So I didn't create a C compiler, but perhaps I learned C? Maybe. I've certainly learned something about writing C code. Chasing down segfaults does not scare me as much as it used to. My skill with related tools like gdb, gcc, ld and as have all improved drastically. But the immense mental effort of trying to learn how to write the compiler meant that I didn't put much care into making the C code as nice as it could have been. Simply solving the current problem without crashing was good enough to move on.

However, pushing myself to solve a problem gets at the heart of what is means to learn. To be able to solve a novel hard problem, even if it just barely works, does require learning. While I can't say that I have learned C, I am quite happy to say that I know much more about C and compiler design than when I started.

Contemplating the original question of whether it would have been better for me to learn C and compiler design separately. Now that the challenge is done, I can appreciate that the difficulty was entirely the point. Since I didn't completely fail, it means I could have set the bar even higher before reaching complete failure. Even then, complete failure would entail learning something.

I think it was worth it to learn both at the same time, purely from the perspective that it let me sense where my limits are. Perhaps it even pushed my limits just a little bit higher.

Peter Norvig's seminal Teach Yourself Programming in Ten Years comes to mind when reflecting on the title of this blog post. I can't completely learn something in a single project, no matter how challenge it was for me. But I can let it be a part of my "Ten Years".

Appendix: A Collection of Code Snippets

case COMMAND_PHI:
printf("PHI\t");
printf("[@%d !%d] ", inst.reg_arg_1, inst.block_id_1);
printf("[@%d !%d]\n", inst.reg_arg_2, inst.block_id_2);
continue; // Funny hack

Instead of having a break; at the end of this switch case, I have a continue;. I can't quite remember what's so funny about it.

int varreg = find_reg(node->ref)->reg;
instruction_t* varinst = malloc(sizeof(instruction_t));
varinst->reg_out = varreg;
return (struct continuation){varinst, current};

At just one spot I couldn't find a way to avoid allocating memory in a recursive function. Instead of keeping track of these allocations and freeing them later, I just accepted the leakage. Memory will be reclaimed after an inevitable segfault regardless.

fprintf(f, "\tmovq\t%%rax, %%rsi\n");

I wanted my programs to print what they compute so that I could see some output. It always seemed like the value of my last defined variable would live in rax. So the last variable is probably the result and it probably lives in rax. Just print that. It worked for all my test cases and my interest didn't extend beyond that.

if (inst.source.t != SSA && inst.dest.t != SSA) {
    // Maybe this doesn't even matter
    // struct liveness li = {iid, clone_int_set(cur_live)};
    // add_liveness(li);
    continue;
}

This commented code is still here. I wonder if it did matter.