# Writing Compilers

• Continuing the discussion from New Masters Program in Software Engineering:

The best of the best are people who have taken such classes. The best person I have ever worked with (friend and occasional business partner) was still in school when taking CS classes still meant that you would be building a compiler as part of a keystone assignment. That seems to have went away. Wherever the student's education stops, the rest of it may as well be magic. The new Informatics degrees are releasing students that have no freaking clue what happens after the code hits the interpreter and that is just way too high of a level to stop your learning.

Just what is involved in writing a compiler? Is it something that could be taken on as a spare time project?

I go by the adage that you should be at least passably familiar with one level of abstraction below what you're working in. If you use C# you should know a bit about memory management, if you're writing SQL, have some idea how a query optimiser does its thing etc. I guess the ultimate step in this process would be the ability to write a compiler for your chosen language.

Can it be done as a hobby for a modern language, or do most compiler writing classes focus on implementing a segment of the spec/a very simple language? Would it have to be done in assembler? Would it have to output assembler? (I guess a C#/Java compiler would output CLI/JVM bytecode)

(p.s. s @Polygeekery I gather this sort of low level stuff s isn't really your forte. I'm only quoting you because you were the last to talk about it. Answers welcome from anybody)

• Get this book, it's awesome:

It takes you though (emulated) hardware, then into writing an assembler and higher level compiler.

• I had one compiler class, it was really more about implementing a programming language than compiling though. We worked on a compiler and stack machine built in Java, which consumed a C-like programming language.

Basically we wrote a stripped-down JVM which executed on the JVM. Still was an interesting course though.

• This is the approach I'd suggest as well -- have them start from the instruction set level upwards, basically (CPU microarchitecture is a bit of a different beast due to its sheer complexity all on its own).

• I wrote a compiler for this language: http://www.cs.uwm.edu/~cs654/handouts/cool-manual.pdf

It's not actually as difficult as it sounds to make a compiler. You basically use a program to compile your syntax description into code that parses the syntax, then you write code to do some semantic stuff like figuring out which variable names refer to the same variable, and then you generate code by walking the parse tree and printing whatever your low level language is.

You can optionally throw in some optimizations as well, but you don't need to.

• My compliers course had a group project for the main output where we were handling a tiny subset of java but you were required to have some additions to the minimum language spec. So we had to do a lexer, some optimizer things, and then final output (IIRC most did output for the JVM to run but not everyone). Will have to dig through stacks to find what textbook we used.

But yeah it is something you could do (course I took was 10 week 400 level). Start with a subset of what to handle first though as you can then add on capabilities later.

• Is it something that could be taken on as a spare time project?

If that is something you want to learn in order to advance yourself, then absolutely it could be. You might want to get a start by looking over the source code to various compilers. I know the GCC source code is freely available, but I am sure there are others. https://gcc.gnu.org/snapshots.html

I go by the adage that you should be at least passably familiar with one level of abstraction below what you're working in.

I would say more than that. Sadly, I know more about how shit works inside of a CPU than most fresh graduates, and as you (correctly) mentioned:

@Polygeekery I gather this sort of low level stuff s isn't really your forte.

You, sir, are a master of understatement.

I think a balanced person should have a passing understanding of how most things function right down to the bare metal. What does a computer do? At its most basic level? It adds, compares or moves numbers. Subtraction is a function of addition. Multiplication is a function of addition. Division is a function of subtraction, which in turn is a function of addition. I have a family member who went through an Informatics program and graduated without knowing that.

Would it have to be done in assembler?

Nope. Self-hosting compilers are compilers written in the language that they are to compile. Like Inception, sort of. Initially, a new language requires a compiler written in another language, and the first ones were written in Assembly, but now you can write a compiler in nearly any language, for nearly any other language.

Would it have to output assembler?

Not necessarily. Some compilers will cross-compile to another lower-level language (like C) and then compile that. In that way, they function somewhat like an interpreter feeding their output into a compiler.

I am no guru, but I hope some of this helps.

• You might want to get a start by looking over the source code to various compilers. I know the GCC source code is freely available, but I am sure there are others. https://gcc.gnu.org/snapshots.html

One problem with that -- the GCC code is pretty hideous still in places (don't go anywhere near reload.c!). LLVM might be more useful for your purposes, although it's still complicated.

I think a balanced person should have a passing understanding of how most things function right down to the bare metal. What does a computer do? At its most basic level? It adds, compares or moves numbers. Subtraction is a function of addition. Multiplication is a function of addition. Division is a function of subtraction, which in turn is a function of addition. I have a family member who went through an Informatics program and graduated without knowing that.

Yikes!

Would it have to output assembler? (I guess a C#/Java compiler would output CLI/JVM bytecode)

LLVM bitcode is another option if you wish to focus solely on the frontend; otherwise, you can do dumb assembler or CLR/JVM bytecode gen.

• I had a three compiler-related courses. The first two built on each other, first covering lexing/parsing (grammars, building an AST), and the second one was more about type checking and turning the AST into actual code (asm, and java byte code, plus some optimizations). These two recommended the dragon book. It's been a while, but IIRC it covers most of what you need to know. (The last course I attended much later, and was all about optimization.)

Writing a compiler dragon book style is quite achievable. Essentially you start of with lexing, then bolt on a parser, then have the parser store its results in an AST. At that point you can already write a (somewhat inefficient) interpretor. If you use the proper tools (flex or an equivalent for lexer-generation, bison or an equivalent for parser-generation), it's surprisingly little work. IIRC I wrote that part from scratch in about one week in the evenings/nights (planning ahead as a student? lol.).

(I'd also recommend writing a compiler/interpretor at least once, you learn a lot of stuff that you'd never think about otherwise. Also, knowing how to do proper lexing and parsing is helpful - it'll help realize how much of an abomination the Discourse-markdown "parser" really is ).

• It used to be traditional to write a Scheme compiler in school, didn't it? Which is why there's approximately 1,001 different Scheme compilers available on the net.

Due to the language's simplicity, you don't need to spend much time on a lexer or grammar and you can focus on the interesting parts of a compiler...

• It's not actually as difficult as it sounds to make a compiler. You basically use a program to compile your syntax description into code that parses the syntax, then you write code to do some semantic stuff like figuring out which variable names refer to the same variable, and then you generate code by walking the parse tree and printing whatever your low level language is.
This, more or less, describes about 90% of Compiler 1 classes (including the one I taught). [Source of 90% statistic: my ass.]

Most real languages are a lot more complicated in weird ways, but that covers the basics.

If that is something you want to learn in order to advance yourself, then absolutely it could be. You might want to get a start by looking over the source code to various compilers.
I have a suspicion that almost everyone would be completely lost starting there; especially because C is an unusually obnoxious language to parse, and C++ even moreso, and at least word on the street is GCC is pretty much a morass.

There is some merit for some people in being thrown in the deep end, but I tend to suspect a more gradual learning curve is better. There are probably compiler classes on MIT OCW and other similar sources; I tend to think that starting with a simple language while learning the basics is generally a better idea. I also don't have citations to support that.

• Some interesting answers, looks like it is something I could conceivably work at in my spare time. My degree is in maths, so all of my computing experience is more practical and less CS.

I would say more than that

I guess there was an implied "at a bare minimum" in there. I don't know much about how chips work at a deep level, but I'm at least aware of the mathsy stuff you mentioned. Never learned any assembly, but I know that it's basically a set of machine code instructions that simplify from "change these bytes" to "move this value between registers or whatever"

It's not actually as difficult as it sounds to make a compiler

I figured it would either be simpler than it sounds or a hell of a lot more difficult.

(I'd also recommend writing a compiler/interpretor at least once, you learn a lot of stuff that you'd never think about otherwise. Also, knowing how to do proper lexing and parsing is helpful - it'll help realize how much of an abomination the Discourse-markdown "parser" really is ).

If it helps me bash Discourse more effectively I'm definitely in

• I don't know much about how chips work at a deep level

Check this out:

I found it fascinating. Inefficient as hell, very slow, but a great visual illustration of what goes on at the nanometer scale inside of a CPU.

• These two recommended the dragon book
I am not the biggest fan of the dragon book, though it's OK. (It is much more complete than anything else that could conceivably be used at the intro level.)

This is the text that I recommended the second time through my class (when I knew better what I was doing):

Would it have to be done in assembler?
As others have said, this isn't even remotely true.

Would it have to output assembler? (I guess a C#/Java compiler would output CLI/JVM bytecode)
The three categories of choices are:

1. Assembler. For compiler classes, MIPS seems to be popular: output to MIPS asm, then emulate with SPIM.
2. Bytecode for JVM or I guess CLI. I know some classes target the JVM; there's an assembler called Jasmin that goes to bytecode. There's no need to make your language C#/Java to compile to this -- remember, the JVM and CLI both host lots of languages. Some of these languages (e.g. Clojure, though I'm not sure if that's actually compiled or just interpreted) look entirely different from Java. A really promising target I think is LLVM; that's growing in popularity a lot.
3. A high-level language, probably C. Early C++ compilers compiled to C. Glasgow Haskell Compiler can compile to C. Lots of compilers in sort of an experimental standpoint can compile to C.

The first time I taught I targeted the JVM, the second time I targeted MIPS. I think the latter worked better than the former. If I were to do it again, I'd strongly look at LLVM.

• Some of these languages (e.g. Clojure, though I'm not sure if that's actually compiled or just interpreted) look entirely different from Java

Clojure is both -- you compile it to bytecode, but you can pack a REPL in with your program as well.

• I have a suspicion that almost everyone would be completely lost starting there; especially because C is an unusually obnoxious language to parse, and C++ even moreso, and at least word on the street is GCC is pretty much a morass.

Fair enough, @tarunik suggested LLVM as a better possibility. I work best under pressure. If I want to learn something, I do some research and just dive in. You are right though, a lot of other people would prefer and learn better by finding something existing and modifying it to their tastes. Usually when a person wants to learn something new, I ask them why they want to learn it (I am a big fan of "why"). Usually from their answer I can ascertain a course of action and some way that they can solve a problem of their own, to keep them interested.

In this case, it is purely academic, and I was a horrible student so I pulled that suggestion from the same place you pulled your earlier statistic:

[Source of 90% statistic: my ass.]

Except...you know...my ass and not yours.

• LLVM.

With LLVM as the backend, and a language grammar built with Lemon, you have about 85% of a native compiler in place—that's my proposed strategy for my next shot at it anyway...

• I've only made some toy interpreters, so I'm not an authority on the subject, but as I understand it writing a compiler requires two main things: knowing the format of the output, and knowing the format of the input. It's then "simply" a matter of matching one to the other.

So, for example, you could build a compiler for java code that instead of spitting byte code for the JVM spit out an executable for windows, or an executable for linux (I hear elf is a simple format for executables). Of course, your compiler would need to provide the framework (add code for the garbage collector, libraries, etc), but you wouldn't need to run on the jvm.

• That's the dragon book, and like I said I'm not a huge fan. But definitely would recommend the second edition if you were to get it. That's not one of those "let's update a few problem numbers and extort another $100 from people" changes; a lot of the compiler landscape looks very different from when the 1st edition was published. (In particular, SSA form had not yet been invented, although there were some things like value numbering that are moving in that direction of course. But SSA is one of the foundational building blocks of modern compilers -- I'm not totally sure about this, but I think that revamping GCC's IR to use SSA form was the main factor in bumping the major version from 3 to 4.) It's also expanded a lot; the second edition has one of the few good accounts of how several different garbage collectors work that I know of, for example. • Well, you see, Discourse invites you to reply without reading the thread, so I replied. • Pretty much. A compiler is just "the command pattern" or the "interpreter pattern" written out as a separate piece of software. So, the software will need to parse the source code, turn it into an abstract syntax tree, and recursively descend the tree, figuring out what commands to issue to the underlying system as it does. In other words, compiling the program means folding the AST using a monad as the accumulating monoid. The latter part can be non-trivial, though. For example, consider an instruction to branch conditionally. If the target language is at a low enough level, you will have to figure out how to make the compiler calculate the right address to jump to. • That's why generating LLVM IR is a pretty good approach. You effectively generate a high-level assembly-language description of what you want even if not very efficient, and let the LLVM take over from there for all the low-level optimization, register allocation, address binding and code-issuing stuff. All the stuff that you usually don't want to care about. It's types are also a lot easier to work with than those of the real underlying machine. It's not perfect, but it does a remarkably good job. As an approach, it's much easier than doing it all yourself, and the LLVM IR (in textual form) is actually quite readable. And yes, it makes writing a compiler as a hobby quite possible. I'm doing just that. (An aside: optimizing compilers are the main users of theorem proving in modern software. Most of the theorems proved are pretty simple, but there's loads of 'em and they're important to doing it right.) • I hear this is the correct book http://www.amazon.com/Compilers-Principles-Techniques-Alfred-Aho/dp/0201100886/ That's what we used in my compiler course (might still have it on my shelf). All I really remember from that was the language was a simplified pascal-like syntax, we wrote it in C, and it was a group project. And I created a debugger (using curses - console "graphics", not colorful words - tho I'm sure there were a few of those). The time spent creating that saved us innumerable hours! (I seem to remember the instructor asking if he could use it in future classes...)  If I remember, I think the guts of the debugger were provided. I just put a "graphical" interface on top. • using curses - console "graphics", not colorful words - tho I'm sure there were a few of those No doubt. • ...let the LLVM take over from there for all the low-level optimization, register allocation, address binding and code-issuing stuff. All the stuff that you usually don't want to care about. Well, that depends on why you're writing a compiler. If you want your own language or otherwise want to play around with the front end, sure, that's the case. But if you are using it as a vehicle to understand how production compilers work and how source turns into executable code, then almost all of the interesting and hard stuff goes on in the optimizer and code generator. • But if you are using it as a vehicle to understand how production compilers work and how source turns into executable code, then almost all of the interesting and hard stuff goes on in the optimizer and code generator. Then one of the places to start is with reading the LLVM source. Pulling yourself up entirely by just your bootstraps is for the masochistic only. • Neat, what language are you compiling from? Perhaps the most appealing feature of LLVM is that it has been designed from the start to be a compiler framework. For researchers like the GHC developers, this is a great benefit and makes LLVM a very fun playground. For example, within a couple of days of the public release of the LLVM backend one developer, Don Stewart, wrote a genetic algorithm to find the best LLVM optimisation pipeline to use for various Haskell programs (you can find his blog post about this here). • But if you are using it as a vehicle to understand how production compilers work and how source turns into executable code, then almost all of the interesting and hard stuff goes on in the optimizer and code generator. And type checking and inference engine... System$F_\omega$ftw. • don't need to spend much time on a lexer or grammar and you can focus on the interesting parts of a compiler... DOES NOT COMPUTE • Neat, what language are you compiling from? I'm making a JIT for Tcl. I can make a decent fist of mathematical code (which gets a gigantic performance boost) but haven't tackled non-numeric or compound types yet. But I do handle spilling from int32 to int64, and the technique used will expand to bignums too. Just haven't done that bit yet. (Trying to fix some other problems first.) In fact, the complex part is deducing what the types are in the first place. I don't want to go down the route of requiring intrusive type annotations, since that would act as a major barrier to using the JIT to accelerate existing user code. It's fun! It'd be even funner if the crashes (when I get it wrong) weren't so hard… • You mean (not (compute))   • You mean <code>(not (compute)) (lambda (it) (not (compute it))) • Point free: not . compute • can you do not$ compute as well or is that different? </haskellnoob>

• not $compute would apply the not function to compute (which makes more sense for the joke). But it is subtly different from not . compute, which applies not to the result of compute. • \it -> not . compute$ it

• λx.nope

• I thought about adding a "or you're doing something with a fancy type system", but decided against it.

• [COMPUTES:NO]

<Dwarf Fortress raws, for anyone wondering.

• 
: RUN COMPUTE NOT . ;
RUN


• On topic, I have a copy of this lurking somewhere in my apartment:

I remember it being pretty up-to-date when I read it, haven't kept up with the state of the art to know how well it's aging. If anyone's interested, I can try to dig it out and give a more comprehensive synopsis.

I think I may have a copy of this too (probably in the same pile of books in my "office"):

• Holy fucknuts, how did I miss this thread before? Thank Eris for Discromancy!

It used to be traditional to write a Scheme compiler in school, didn't it? Which is why there's approximately 1,001 different Scheme compilers available on the net.

There is a really good white paper on this subject from around 2006 called
"An Incremental Approach to Compiler Construction" which walks through using Scheme in this manner to simplify teaching basic compiler design. It leverages the Scheme read function to avoid having to handle parsing at all, so it can focus on the code generation.

The approach is quite different from conventional compiler courses, which usually focus on (semi-)formal methods for lexical analysis and parsing, and usually end up building an LL(1) recursive-descent parser for some simplified Algol family language (the one I took even used 'Baby Algol' as the source language - in 2008 - I gather that this was because the professor was changing the language every time she taught the course to reduce cheating), though the implementation language was left to the student (most used either C, C++, or Java; I used Python, which threw the professor a bit but worked like a charm once I stopped over-thinking the lexer).

It's not actually as difficult as it sounds to make a compiler

I figured it would either be simpler than it sounds or a hell of a lot more difficult.

Both. Writing a simple ad-hoc lexer and RD direct-emit parser (i.e., one which simply writes out the assembly code to a file directly from the parser, rather than passing it a separate code generator) with little or no optimization is a snap, if time consuming, but they tend to run slowly, have crappy error detection and error messages, and produce shitty code. The hardest parts for most students is to first step of wrapping their head around the mathematical formalisms - which is why most compiler courses spend half of the course teaching how to apply Finite-State Automata to writing the lexical analyzer, the next quarter to third on writing ε productions for the parser, and using whatever is left over to cover writing and debugging the actual compiler (which generally covers only basic expressions and conditionals, an indefinite loop, and maybe simple non-returning functions of no arguments).

A production quality compiler, OTOH, is usually a massive project, comparable in scope to developing a production quality operating system. The typical compiler course doesn't even see something like that on the horizon, never mind approach it.

• A production quality compiler, OTOH, is usually a massive project, comparable in scope to developing a production quality operating system.

A production-quality compiler includes writing an optimiser, and is where the theory-heavy parts of compilation start to really kick in. Unless you're doing deep type analysis (which you might or might not need; depends on the language) when the brain-bending stuff kicks in sooner. Ultimately, the compiler writer is usually trying to translate the code that was written into something that is well-typed with sane semantics, and then the optimiser turns that into something that can be used to issue good machine code.

That compilers work as well as they do can be viewed as the ultimate proof of the success and usefulness of theoretical computer science, as they're very very dependent on decades of work in the theory side of things.

• That compilers work as well as they do can be viewed as the ultimate proof of the success and usefulness of theoretical computer science, as they're very very dependent on decades of work in the theory side of things.

QFT.

• What does a computer do? At its most basic level? It adds, compares or moves numbers.

I'd argue that at its most basic level, it's handling not numbers but bits. Some of its internal operations are designed for manipulating groups of those bits that encode numbers, but most are not.

Subtraction is a function of addition. Multiplication is a function of addition. Division is a function of subtraction, which in turn is a function of addition.

Conceptually, sure. As actually implemented on modern hardware, not so much.

• I'd argue that at its most basic level, it's handling not numbers but bits. Some of its internal operations are designed for manipulating groups of those bits that encode numbers, but most are not.

At the conceptual level, computers handle symbols. Those symbols are assigned meanings by the context outside the computer and are essentially arbitrary, though assistance is given for treating certain symbols as numbers and manipulations of those symbols as arithmetic (because that's f'ing convenient).

• At a conceptual level, computers are magical super-genius dogs but instead of fetching sticks, they fetch answers from an alternative dimension named compu-town which they can teleport to and fro so quickly that heat sinks are needed to disperse heat generated by the air friction. From the magical teleporting dogs.

• At the conceptual level, computers handle symbols.

I still think that aside from the occasional ternary oddity you get the best bang for your conceptual buck from conceiving of the computer's raw materials specifically as bits. And sure, bits are a kind of symbol and yes, their two possible values and/or the distinction between those values always has a context-dependent meaning if it has any meaning at all.

The only way a computer ever actually manipulates symbols in the more general sense is via encodings of those as groups of bits. Also, at least some of what computers do is straight-up bit manipulation where the bits in and of themselves are assigned individual meanings rather than participating in some coding for another symbol.

Thinking on the level of generalized symbols is very useful as well, and certainly more accurate than the common notion that everything the computer works with is actually numbers, but it's not really foundational.

• I still think that aside from the occasional ternary oddity you get the best bang for your conceptual buck from conceiving of the computer's raw materials specifically as bits.

That depends on whether you're looking at computer architecture or software architecture. In terms of hardware, sure, it's bits and collections of bits. From a software perspective, it's symbols. Some people like to call those symbols by other names, such as “characters”, “words”, “numbers”, “identifiers”, etc., but the reality's just symbols.

Looks like your connection to What the Daily WTF? was lost, please wait while we try to reconnect.