Compiling Pascal with LLVM: Part 2

Parsing

Now to the fun part. We'll be using a recursive parser, just like in Crafting interpreters, so you can easily skip this part if you're familiar with the technique. I made a slight modification to the parsing of binary operators though.

I wrote the parser for Pascal's grammar from memory (a very long-term memory), and apparently ended up with a less restrictive grammar. I might fix that in the future though.

We'll start with a simple class with several helper methods:

from tokenize import TokenInfo
from more_itertools import peekable
from ..tokenizer import TokenType


# just for convenience
class ParseError(Exception):
    pass


class Parser:
    def __init__(self, tokens):
        self.tokens = peekable(tokens)

    def consume(self, *types: TokenType, string: str | None = None) -> TokenInfo:
        if not self.matches(*types, string=string):
            raise ParseError(self.peek(), types, string)
        return next(self.tokens)

    def consumed(self, *types: TokenType, string: str | None = None) -> bool:
        success = self.matches(*types, string=string)
        if success:
            self.consume()
        return success

    def peek(self) -> TokenInfo:
        if not self.tokens:
            raise ParseError
        return self.tokens.peek()

    def matches(self, *types: TokenType, string: str | None = None) -> bool:
        if not self.tokens:
            return False
        token = self.peek()
        if types and token.type not in types:
            return False
        if string is not None and token.string.lower() != string.lower():
            return False
        return True

basically all they do is consume the next token while checking some constraints.

Next, we'll build a collection of nodes, which the parser will have to generate.

Expressions

Expressions are the parts of code that produce a value. For example, these

1 + 1
f(1, 2, 3)
1 + f(2) + x.count * array[0]

are expressions, but this isn't

if x = 1 then
begin
end;

Technically, not all calls produce values, because Pascal makes a distinction between functions and procedures, but here we'll treat procedures as functions that return void, and let the type system worry about them.

Literals

The simplest expression is a literal like 1 or 'abc'.

from dataclasses import dataclass
from typing import Any

# we want each node to be unique, so that two nodes with same fields still won't be equal  
unique = dataclass(eq=False)


@unique
class Const:
    value: Any
    type: str

next we'll add a _primary method to the Parser. Which will return this node:

# TokenType was defined in the previous post
from .tokenizer import TokenType


class Parser:
    # ...

    # expression is just a convenience method for now
    def _expression(self):
        return self._primary()

    def _primary(self):
        match self.peek().type:
            case TokenType.NUMBER:
                body = self.consume().string
                if '.' not in body:
                    return Const(int(body), 'integer')
                return Const(float(body), 'real')

            case TokenType.STRING:
                value = self.consume().string
                if not value.startswith("'"):
                    raise ParseError('Strings must start and end with apostrophes')
                value = eval(value).encode() + b'\00'
                return Const(value, 'string')

            case _:
                raise ParseError(self.peek())

Here we use null-terminated strings, which is not very Pascal'ish. However, this will simplify things for us when we'll be working with C functions.

and now you can spin up the whole thing:

text = '1'
parser = Parser(tokenize(text))
print(parser._expression())

this should give you something like Const(value=1, type='integer'), and if you try and call it with

text = 'x + y'

it would raise a ParseError. So far so good!

Names

While we're at it, let's add another simple node - name access. We'll emit it each time someone references a variable or function by its name:

@unique
class Name:
    name: str

this will be another case in the _primary function:

case TokenType.NAME:
    return Name(self.consume().string)

Tails

Next we'll parse something more interesting:

f(1, 2, x, y)
student.name
array[index]
myPointer^

So, function calls, field access, array indexing and pointers dereferencing. As usual let's add some nodes first:

@unique
class GetItem:
    target: Any
    args: tuple[Any]


@unique
class GetField:
    target: Any
    name: str


@unique
class Dereference:
    target: Any


@unique
class Call:
    target: Any
    args: tuple[Any]

and a new method _tail. Let's start with pointers:

class Parser:
    # ...

    def _expression(self):
        return self._tail()

    def _tail(self):
        target = self._primary()
        while self.matches(TokenType.LSQB, TokenType.DOT, TokenType.CIRCUMFLEX, TokenType.LPAR):
            match self.consume().type:
                case TokenType.CIRCUMFLEX:
                    target = Dereference(target)

                # ... other cases here

        return target

So, what's going on here?

First we parse the target - a constant or a variable name. This is exactly what _primary returns
Next, we check whether it has a tail - a dereferencing operator, a field access etc
Finally, we wrap the target in a new node

Also, because we use a while loop here, we'll be able to parse stuff like

f(1, 2)[3].names[0]^

Now let's add this logic. Field access is also as simple as

case TokenType.DOT:
    name = self.consume(TokenType.NAME).string
    target = GetField(target, name)

Here we already consumed the dot in the match self.consume... part, so we consume the second token, which must be a name, and finally we wrap the target in the GetField node.

Finally, two more interesting tails:

# array indexing
case TokenType.LSQB:
    args = [self._expression()]
    while self.consumed(TokenType.COMMA):
        args.append(self._expression())
    self.consume(TokenType.RSQB)
    target = GetItem(target, tuple(args))

# function call
case TokenType.LPAR:
    args = []
    while not self.matches(TokenType.RPAR):
        if args:
            self.consume(TokenType.COMMA)
        args.append(self._expression())
    self.consume(TokenType.RPAR)
    target = Call(target, tuple(args))

In both cases we parse a sequence of arguments separated by commas, and consume the matching bracket at the end.

Unary operators

In case of tails we were able to use a while loop, because we first needed to parse the target, and then an indefinite number of tails. In case of unary operators, though, the target comes at the very end, which means that we first need to consume all the unary operators, push them into a stack, then consume the target, and wrap it while unwinding the stack. This sounds like recursion to me:

def _unary(self):
    if self.peek().string.lower() in ('@', 'not', '-', '+'):
        return Unary(self.consume().string.lower(), self._unary())
    return self._tail()

and let's not forget the node:

@unique
class Unary:
    op: str
    value: Any

Binary operators

Binary operators are particularly interesting, because with them the notion of priority or precedence arises very naturally. If you're not a lisper, then probably you wouldn't like writing

1 + 2 * 3 - 4 / 5 + 6

as

(((1 + (2 * 3)) - (4 / 5)) + 6)

this is where operators precedence kicks in.

By the way, you should definitely check out SICP, it's a timeless classic!

We'll parse that in the following way. First, let's group the operators by precedence:

PRIORITIES = {
    '*': 1,
    '/': 1,
    'div': 1,
    'mod': 1,
    'and': 1,
    '+': 2,
    '-': 2,
    'or': 2,
    '>': 3,
    '>=': 3,
    '<=': 3,
    '<': 3,
    '=': 4,
    '<>': 4,
    'in': 4,
}
MAX_PRIORITY = max(PRIORITIES.values())

next we start from the operations with the lowest precedence (max priority), and try to build its arguments from terms of higher precedence:

def _binary(self, priority):
    if priority <= 0:
        return self._unary()

    left = self._binary(priority - 1)
    while self.peek().string in PRIORITIES:
        op = self.peek().string
        current = PRIORITIES.get(op)
        # only consume the operation with the same priority
        if current != priority:
            break

        self.consume()
        right = self._binary(current - 1)
        left = Binary(op, left, right)

    return left

Basically here we're saying, that

1 * 2 + 3 * 4

is, first of all, a + with its arguments 1 * 2 and 3 * 4. This is why both left and right are assigned to operators with priority - 1. At the end we just fallback to _unary when we run out of operators.

Finally, as always, the node:

@unique
class Binary:
    op: str
    left: Any
    right: Any

and also let's update the _expression:

def _expression(self):
    return self._binary(MAX_PRIORITY)

Parenthesis

Yes, we just got rid of parentheses with the help of precedence. Now let's add them back in!

Parentheses have a very high priority, higher than any binary or unary operation or even a function call. So, _primary is the best place for them:

def _primary(self):
    match self.peek().type:
        case TokenType.LPAR:
            self.consume()
            value = self._expression()
            self.consume(TokenType.RPAR)
            return value

        # ... previous code

We don't need a separate node for this branch, because the information about precedence kind of emerges in the syntax tree by itself. However, sometimes it's a good idea to keep this information more explicit, e.g. when you're using your parser to write a code formatter.

This took a while, but we're finally done with parsing expressions. Now you can play with the parser and generate really complex structures. Note how parentheses influence (sometimes) the structure of the syntax tree.

Statements

Until now, we wrote the parser starting with the deepest nodes. This is useful when you want to be able to spin up the parser as you progress, and watch how it gets smarter. We'll do the same thing with statements, but in reverse.

Program

We'll start with simple programs that don't contain variables or functions:

@unique
class Program:
    body: tuple[Any]

Don't worry, we'll update this node later.

def _program(self):
    # program name
    self.consume(TokenType.NAME, string='program')
    self.consume(TokenType.NAME)
    if self.consumed(TokenType.LPAR):
        self.consume(TokenType.NAME)
        self.consume(TokenType.RPAR)

    self.consume(TokenType.SEMI)

    # the body, including the `end`
    statements = self._body()

    # final dot
    self.consume(TokenType.DOT)
    # no more tokens should remain
    if self.tokens:
        raise ParseError(self.tokens.peek())

    return Program(statements)

The only interesting part is the name parsing. It turns out, that Pascal allows the programmer to specify the names of input and output files used, something like

program myName(output_file);

I didn't know that until I started googling for larger samples of Pascal code just to test my parser.

The body

The body is just a sequence of statements between begin and end:

from jboc import composed

# ...

@composed(tuple)
def _body(self):
    self.consume(TokenType.NAME, string='begin')
    while not self.matches(TokenType.NAME, string='end'):
        if self.matches(TokenType.NAME, string='begin'):
            yield from self._body()
        else:
            yield self._statement()
        # the semicolon is optional in the last statement
        if not self.matches(TokenType.NAME, string='end'):
            self.consume(TokenType.SEMI)
    self.consume(TokenType.NAME, string='end')

Note that there's a special case we're dealing with: begin-end blocks can be nested. However, unlike most of modern languages, Pascal doesn't allow variables definition inside blocks, so this nesting doesn't have any semantic implications, that's why we're just unpacking it.

Here @composed(tuple) simply gathers all we've yeild-ed from the function into a tuple. This is handy in many situations.

The statement

For now, the only statement we know of is an expression statement, like a function call, or a useless arithmetic operation that isn't stored anywhere:

def _statement(self):
    value = self._expression()
    value = ExpressionStatement(value)
    return value

with a simple node

@unique
class ExpressionStatement:
    value: Any

Aaaaand we're finally ready to parse entire (but simple) programs like this one:

program itWorks;
begin
    1 + 2;
    writeln(3, 4, 5);
    writeln(6 + 7, 8 * 9)
end.

Pretty cool! The rest is just about adding more content. Let's dig quickly through the basics.

If

Because begin-end is only required for multiple expressions, we'll be using this util function a lot:

def _flexible_body(self):
    if self.matches(TokenType.NAME, string='begin'):
        return self._body()
    return self._statement(),

Now let's parse some statements:

def _if(self):
    self.consume(TokenType.NAME, string='if')
    condition = self._expression()
    self.consume(TokenType.NAME, string='then')
    left = self._flexible_body()
    if self.consumed(TokenType.NAME, string='else'):
        right = self._flexible_body()
    else:
        right = ()
    return If(condition, left, right)

Yep, basic if-then-else stuff.

@unique
class If:
    condition: Any
    then_: tuple[Any]
    else_: tuple[Any]

For and while

This is more or less the same

def _for(self):
    self.consume(TokenType.NAME, string='for')
    name = Name(self.consume(TokenType.NAME).string)
    self.consume(TokenType.COLONEQUAL)
    start = self._expression()
    self.consume(TokenType.NAME, string='to')
    stop = self._expression()
    self.consume(TokenType.NAME, string='do')
    body = self._flexible_body()
    return For(name, start, stop, body)

def _while(self):
    self.consume(TokenType.NAME, string='while')
    condition = self._expression()
    self.consume(TokenType.NAME, string='do')
    body = self._flexible_body()
    return While(condition, body)

and the nodes

@unique
class For:
    name: Name
    start: Any
    stop: Any
    body: tuple[Any]

@unique
class While:
    condition: Any
    body: tuple[Any]

Assignments

With new structures in place, it's time to update the _statement. We'll also add assignment syntax while we're at it:

def _statement(self):
    if self.matches(TokenType.NAME, string='if'):
        return self._if()
    if self.matches(TokenType.NAME, string='for'):
        return self._for()
    if self.matches(TokenType.NAME, string='while'):
        return self._while()

    value = self._expression()
    if isinstance(value, (Name, GetItem, GetField, Dereference)) and self.consumed(TokenType.COLONEQUAL):
        value = Assignment(value, self._expression())
    else:
        value = ExpressionStatement(value)
    return value

Here what's going on. The last part is all about the difference between

student.grade;

and

student.grade := 10;

Yes, that's how we assign stuff to variables in Pascal - with the notorious walrus operator

We can't make a difference beforehand, because the expression can have a super long tail:

students[0].math.grades[1] := 10;

So we'll just start parsing it like a simple expression, then check the next token. If it's := - we've got an assignment, otherwise - it's a simple expression statement.

And, of course, assignment deserves its own node:

@unique
class Assignment:
    target: Name | GetItem | GetField | Dereference
    value: Any

Here we only allow assignment for certain expressions, because, say, 1 + 1 := 3 doesn't make a lot of sense

Variables

Variable definition in Pascal can only happen in a special block before the body, to complicate things, we are allowed to use several such blocks:

var x1, x2: integer;
    y: real;
var z: string;

so we'll have to do more checks than usual:

@composed(tuple)
def _variables(self):
    while self.consumed(TokenType.NAME, string='var'):
        while self.peek().string.lower() not in ('var', 'function', 'procedure', 'begin'):
            yield self._definition()

Here a definition is a set of names that share a type:

@unique
class Definitions:
    names: tuple[Name]
    type: str

and we parse it like so

def _definition(self):
    names = [Name(self.consume(TokenType.NAME).string)]
    while self.consumed(TokenType.COMMA):
        names.append(Name(self.consume(TokenType.NAME).string))
    self.consume(TokenType.COLON)
    kind = self._type()
    self.consume(TokenType.SEMI)
    return Definitions(tuple(names), kind)

_type for now is just

def _type(self):
    if self.peek().string.lower() not in ('real', 'integer', 'string'):
        raise ParseError(self.peek())
    return self.consume().string.lower()

We'll seriously upgrade it in the next post, though.

Note that I wrote kind = self._type() instead of type = self._type(). This is because type is a builtin name in Python. And builtins shadowing, in most situations, is considered as bad code.

Functions

We're almost there, I promise! Functions are pretty similar to the main program itself: they have the same blocks + arguments and a return type (which can be void):

@unique
class ArgDefinition:
    name: Name
    type: str

@unique
class Function:
    name: Name
    args: tuple[ArgDefinition]
    variables: tuple[Definitions]
    body: tuple[Any]
    return_type: str

Why am I using here Name instead of str as the function and arg names? That's because, unlike you, I know the future! In the next post we'll make heavy use of these names wrapped in Name.

All we need to parse a function is just

def _function(self):
    name, args, ret = self._prototype()
    variables = self._variables()
    body = self._body()
    self.consume(TokenType.SEMI)
    return Function(name, args, variables, body, ret)

Kinda anticlimactic, right? All the interesting part happens inside the prototype:

def _prototype(self):
    is_func = self.consumed(TokenType.NAME, string='function')
    if not is_func:
        self.consume(TokenType.NAME, string='procedure')

    name = Name(self.consume(TokenType.NAME).string)

    args = []
    if self.consumed(TokenType.LPAR):
        while not self.matches(TokenType.RPAR):
            mutable = self.consumed(TokenType.NAME, string='var')
            group = [Name(self.consume(TokenType.NAME).string)]
            while self.consumed(TokenType.COMMA):
                group.append(Name(self.consume(TokenType.NAME).string))
            self.consume(TokenType.COLON)
            kind = self._type()
            if mutable:
                kind = f'reference({kind})'
            args.extend(ArgDefinition(x, kind) for x in group)
            self.consumed(TokenType.COMMA, TokenType.SEMI)

        self.consume(TokenType.RPAR)

    if is_func:
        self.consume(TokenType.COLON)
        ret = self._type()
    else:
        ret = 'void'
    self.consume(TokenType.SEMI)
    return name, tuple(args), ret

Most of this should be pretty easy by now, except maybe the part with kind = f'reference({kind})'. What's going on here?

Pascal allows us to define mutable arguments in functions:

procedure inc(var x: integer);
begin
    x := x + 1;
end;

This function doesn't return anything, it just changes its argument inplace. All this thanks to that little var. For now I added this ugly crutch kind = f'reference({kind})', in the next post we'll fix it

Final program

With all the parts in place this is what the program will really look like:

@unique
class Program:
    variables: tuple[Definitions]
    functions: tuple[Function]
    body: tuple[Any]

and we'll parse it as simply as:

def _program(self):
    self.consume(TokenType.NAME, string='program')
    self.consume(TokenType.NAME)
    if self.consumed(TokenType.LPAR):
        self.consume(TokenType.NAME)
        self.consume(TokenType.RPAR)

    self.consume(TokenType.SEMI)

    variables = self._variables()
    functions = []
    while self.peek().string.lower() in ('function', 'procedure'):
        functions.append(self._function())

    statements = self._body()

    self.consume(TokenType.DOT)
    if self.tokens:
        raise ParseError(self.tokens.peek())

    return Program(variables, tuple(functions), statements)

This wasn't as sparse as the previous post, was it? Now, with all the bits in place, let's get to something even cooler: next time we'll build an entire type system, with compile-time errors and stuff!

Compiling Pascal with LLVM: Part 2

Parsing

Expressions

Literals

Names

Tails

Unary operators

Binary operators

Parenthesis

Statements

Program

The body

The statement

If

For and while

Assignments

Variables

Functions

Final program

Comments