Compilers

This write-up is about a compiler my group and I have written during the WACC group project during Imperial 2nd year of Computing.

I was quite fascinated about how interesting (and challenging) it is to write a compiler even for simple languages (like WACC). While my initial thoughts about compilers were biased (parsers), using a functional language like Scala 3 helped me get over this and enjoy the process.

Structure of Compilers

Typically, a compiler consists of a: 1) Frontend 2) Backend

Although, it is possible to have "mid-end"s between the two stages, for example, if the compiler supports staged metaprogramming.

While the frontend is concerned with the syntax and semantics of the language and validating it, the backend consists of code generation and emitting of assembly.

However, it is possible to compile a language into another source language instead of direct generation of assembly, in which case these are called transpilers. Originally, Haskell used to emit C code instead of assembly, making it a transpiler.

Frontend

The frontend typically has this pipeline:

1) Parser 2) Renamer 3) Typechecker

Obviously, this pipeline can be expanded / shrinked depending on language grammar / complexity.

For example, Scala 3 has 80+ passes! They are:

phase name  description
----------  -----------
    parser  scan and parse sources
     typer  type the trees
checkUnusedPostTyper  check for unused elements
checkShadowing  check for elements shadowing other elements in scope
inlinedPositions  check inlined positions
  sbt-deps  sends information on classes' dependencies to sbt
extractSemanticDBExtractSemanticInfo
            extract info into .semanticdb files
 posttyper  additional checks and cleanups after type checking
unrollDefs  generates forwarders for methods annotated with @unroll
prepjsinterop  additional checks and transformations for Scala.js
SetRootTree  set the rootTreeOrProvider on class symbols
   pickler  generates TASTy info
   sbt-api  sends a representation of the API of classes to sbt
  inlining  inline and execute macros
postInlining  add mirror support for inlined code
   staging  check staging levels and heal staged types
  splicing  splicing
pickleQuotes  turn quoted trees into explicit run-time data
            structures
checkUnusedPostInlining  check for unused elements
instrumentCoverage  instrument code for coverage checking
crossVersionChecks  check issues related to deprecated and experimental
firstTransform  some transformations to put trees into a canonical form
checkReentrant  check no data races involving global vars
elimPackagePrefixes  eliminate references to package prefixes in Select
            nodes
cookComments  cook the comments: expand variables, doc, etc.
checkLoopingImplicits  check that implicit defs do not call themselves in an
            infinite loop
betaReduce  reduce closure applications
inlineVals  check right hand-sides of an `inline val`s
expandSAMs  expand SAM closures to anonymous classes
elimRepeated  rewrite vararg parameters and arguments
 refchecks  checks related to abstract members and overriding
dropForMap  Drop unused trailing map calls in for comprehensions
extractSemanticDBAppendDiagnostics
            extract info into .semanticdb files
initChecker  check initialization of objects
protectedAccessors  add accessors for protected members
extmethods  expand methods of value classes with extension methods
uncacheGivenAliases  avoid caching RHS of simple parameterless given aliases
checkStatic  check restrictions that apply to @static members
elimByName  map by-name parameters to functions
hoistSuperArgs  hoist complex arguments of supercalls to enclosing
            scope
forwardDepChecks  ensure no forward references to local vals
specializeApplyMethods  adds specialized methods to FunctionN
tryCatchPatterns  compile cases in try/catch
patternMatcher  compile pattern matches
preRecheck  preRecheck
   recheck  recheck
   setupCC  prepare compilation unit for capture checking
        cc  capture checking
elimOpaque  turn opaque into normal aliases
explicitJSClasses  make all JS classes explicit
explicitOuter  add accessors to outer classes from nested ones
explicitSelf  make references to non-trivial self types explicit as
            casts
interpolators  optimize s, f, and raw string interpolators
dropBreaks  replace local Break throws by labeled returns
pruneErasedDefs  drop erased definitions and simplify erased expressions
uninitialized  eliminates `compiletime.uninitialized`
inlinePatterns  remove placeholders of inlined patterns
vcInlineMethods  inlines calls to value class methods
seqLiterals  express vararg arguments as arrays
intercepted  rewrite universal `!=`, `##` methods
   getters  replace non-private vals and vars with getter defs
specializeFunctions  specialize Function{0,1,2} by replacing super with
            specialized super
specializeTuples  replaces tuple construction and selection trees
collectNullableFields  collect fields that can be nulled out after use in lazy
            initialization
elimOuterSelect  expand outer selections
resolveSuper  implement super accessors
functionXXLForwarders  add forwarders for FunctionXXL apply methods
paramForwarding  add forwarders for aliases of superclass parameters
genericTuples  optimize generic operations on tuples
letOverApply  lift blocks from receivers of applications
arrayConstructors  intercept creation of (non-generic) arrays and
            intrinsify
   erasure  rewrite types to JVM model
elimErasedValueType  expand erased value types to their underlying
            implementation types
 pureStats  remove pure statements in blocks
vcElideAllocations  peep-hole optimization to eliminate unnecessary value
            class allocations
 etaReduce  reduce eta expansions of pure paths
arrayApply  optimize `scala.Array.apply`
addLocalJSFakeNews  adds fake new invocations to local JS classes in calls
            to `createLocalJSClass`
elimPolyFunction  rewrite PolyFunction subclasses to FunctionN subclasses
   tailrec  rewrite tail recursion to loops
completeJavaEnums  fill in constructors for Java enums
     mixin  expand trait fields and trait initializers
  lazyVals  expand lazy vals
   memoize  add private fields to getters and setters
nonLocalReturns  expand non-local returns
capturedVars  represent vars captured by closures as heap objects
constructors  collect initialization code in primary constructors
instrumentation  count calls and allocations under -Yinstrument
lambdaLift  lifts out nested functions to class scope
elimStaticThis  replace This references to static objects by global
            identifiers
countOuterAccesses  identify outer accessors that can be dropped
dropOuterAccessors  drop unused outer accessors
dropParentRefinements  drop parent refinements from a template
checkNoSuperThis  check that supercalls don't contain references to This
   flatten  lift all inner classes to package scope
transformWildcards  replace wildcards with default values
moveStatic  move static methods from companion to the class itself
expandPrivate  widen private definitions accessed from nested classes
restoreScopes  repair rendered invalid scopes
selectStatic  get rid of selects that would be compiled into
            GetStatic
junitBootstrappers  generate JUnit-specific bootstrapper classes for
            Scala.js
Collect entry points  collect all entry points and save them in the context
collectSuperCalls  find classes that are called with super
repeatableAnnotations  aggregate repeatable annotations
  genSJSIR  generate .sjsir files for Scala.js
  genBCode  generate JVM bytecode

The last stage, genBCode generates the JVM bytecode (as Scala is a JVM language), therefore it can be considered as its backend.

Scala 3 is a quite complicated language with an extensive syntax. Therefore, it makes sense for it to have this large of a frontend, which is experienced by its somewhat longer compilation times.

However, WACC is a simple language based on the While language family, therefore the frontend can be consisting of only these 3 stages.

Note

A lexing stage (in which source language is converted to tokens) might come before parser, but since we used Parsley, it handled lexing and parsing altogether, which eliminated the need of a separate lexing stage.

Parser

Parser is the stage where source language gets converted to syntactic information. Note that the context is syntactic, which means it is just concerned with converting the language into a tree structure (called AST) by obeying language grammar rules.

The grammar rules are often represented in the BNF form, such as:

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 |7 | 8 | 9
<integer> ::= <digit> | <digit><integer>
<floating point> ::= <integer>.<integer>

Or more generally:

<rule> ::= <token1> | <token2> | ...

These set of BNF grammar rules are able to fully describe the source language, and it is left to parser to parse the source language and generate an AST structure.

We used a parser-combinator library called Parsley to make our lives easier. Combined with the ease of use of Scala, representing the language using combinators were effective and simple.

Our example parsers include:

/* Statements */
private lazy val stmt: Parsley[Stmt[UN]] =
  (extraSemicolonErr | skip | decl | update | scope | read | free | ret | exit | print | println | ifElse | whileDo)
    .label("statement")
    .explain("expected a valid statement (e.g. assignment, control flow, or I/O)")


private lazy val skip: Parsley[Stmt.Skip[UN]]       = Skip from "skip"
private lazy val decl: Parsley[Stmt.Decl]           = Decl(typ, unnamed, "=" ~> rvalue)
private lazy val update: Parsley[Stmt.Update[UN]]   = Update(lvalue <~ "=", rvalue, semTypeUnknown)
private lazy val scope: Parsley[Stmt.Scope[UN]]     = Scope(block(stmts), noVars)
private lazy val read: Parsley[Stmt.Read[UN]]       = Read("read" ~> lvalue)
private lazy val free: Parsley[Stmt.Free[UN]]       = Free("free" ~> expr)
private lazy val ret: Parsley[Stmt.Return[UN]]      = Return("return" ~> expr)
private lazy val exit: Parsley[Stmt.Exit[UN]]       = Exit("exit" ~> expr)
private lazy val print: Parsley[Stmt.Print[UN]]     = Print("print" ~> expr)
private lazy val println: Parsley[Stmt.Println[UN]] = Println("println" ~> expr)
private lazy val ifElse: Parsley[Stmt.IfElse[UN]]   = IfElseScoped("if" ~> expr, (missingThenErr | "then") ~> Scope(stmts, noVars), "else"
  .explain("if statements must have both a then and an else branch") ~> Scope(stmts, noVars) <~ "fi")

As can be seen, it is relatively easy to parse source language (strings) into AST Nodes.

AST

An abstract syntax tree (AST) is a tree data structure that is used to store language information.

It mainly consists of Expressions (Expr), statements (Stmt), and other nodes depending on the language.

In Scala 3, we modeled this using enums and case classes, which makes pattern matching incredibly powerful later on. For instance, a simple expression AST might look something like this:

object Expr {
  sealed trait Literal[T <: Id] extends Expr[T]
  // ... (literals)
  sealed trait BinOp[T <: Id] extends Expr[T] { val x: Expr[T] ; val y: Expr[T] }
  // ... (binary ops)
  sealed trait UnOp[T <: Id] extends Expr[T] { val x: Expr[T] }
  // ... (unary ops)
  // more nodes

Once the parser gives us this tree, the syntactic job is done. But just because a program is structurally correct doesn't mean it makes sense.

We have to convert syntactic context into a semantic context, validating the semantics of the language as we traverse the AST.

For example, 1 + "hello" is perfectly valid syntax (parsed as BinOp.Add(Literal.Int(1), Literal.String("hello"))), but it is a semantic nonsense.

This is where the rest of the frontend pipeline comes in.

Renamer

You can declare a variable x in the main scope, and then declare another x inside a while loop. While the parser doesn't care about this and just sees identical strings, the variables are in different scope contexts.

The Renaming stage (often tied to Scope Resolution) walks through the AST and disambiguates these names. It ensures that every variable reference points to the exact declaration it belongs to. By giving every variable a unique ID, we safely resolve shadowing issues.

This pass essentially sanitizes the AST, making life much easier for the Typechecker.

For example:

begin
  int x = 10;
  begin
    string x = "a"
  end
end

The variable declarations have to be renamed as: - For the outer scope x: Update(Renamed(x, 0, ...), Literal.Int(10)) - For the inner scope x: Update(Renamed(x, 1, ...), Literal.String("a"))

Typechecker

This is where the semantic rules of WACC are strictly enforced. The typechecker traverses the renamed AST and answers constraints such as:

Type mismatch
Function parameter mismatch
Freeing on a non-freeable type

and so on.

Since we are writing this in Scala, the typechecker is essentially a massive set of recursive match statements over the AST nodes. It computes the expected type of every Expr and ensures every Stmt is valid within its current context, and ensures types "satisfy" their constraints.

The typechecker also has to factor in type weakenings, such as conversion of char[] into string.

If a program passes this stage without throwing any semantic errors, we finally have a structurally and semantically sound program. The frontend's job is officially complete.

Backend

todo