Compiling Pascal with LLVM: Part 1
I always wanted to learn LLVM, but I never felt that there are some useful problems I could solve with it in my line of work. Eventually I decided to just have some fun and make something dumb and not useful at all. Yes, we're gonna compile Pascal! A language that I used for the last time like 15 years ago.
This series of posts is highly inspired by the brilliant book "Crafting interpreters" by Bob Nystrom as well as the official tutorial for LLVM. If you're into parsers or compilers, you should definitely check them out!
This is a series of four posts:
And here you can view the final result!
Why Pascal?
There are two main reasons:
- Pascal is in a kind of sweet spot: it has a pretty simple grammar, so writing a parser would be fairly easy, but it has a lot of constructs not covered in the LLVM tutorial, like references, pointers, record types, functions overloading, static typing and so on
- back in school my friend wrote a full-blown roguelike in Pascal, and it would be really cool to be able to compile it by myself. So yes, nostalgia plays a role in it, duh.
What you'll need
Everything is written in Python3.11
with the llvmlite package.
You can find the (almost) full implementation here. It lacks some minor stuff,
like subrange types, but at this point adding them is more about
implementing a small interface, than inventing something new.
Tokenization
The simplest part is tokenization. This is usually the most boring part, at least for me, so I decided to piggyback on
Python's tokenize
builtin module. We'll just need to fix a few token types, and we're good to go:
import token as _token
import tokenize as tknz
from contextlib import suppress
from more_itertools import peekable
# a nice namespace for token types
TokenType = _token
FIX_EXACT = ';', '(', ')', ',', ':', ':=', '[', ']', '^', '@', '.'
def tokenize(text):
def generator():
for x in text.splitlines():
# `generate_tokens` treats too many blank lines as "end of stream", so we'll patch that
x = x.strip() or '//empty'
yield x
tokens = peekable(tknz.generate_tokens(generator().__next__))
while tokens:
token: tknz.TokenInfo = next(tokens)
# ... patch stuff here
Here I'm patching empty lines to avoid some unwanted behavior and I'm wrapping the stream of tokens in peekable
.
Even though I could simply gather all the tokens into a list beforehand, I like the idea that we'll be needing at most
the next token to scan (and later parse) the whole grammar.
Now let's fix the results.
Exact token types
By default the tokens from FIX_EXACT
are scanned as OP
, but we'll need more granular control over them during
parsing:
if token.string in FIX_EXACT:
token = token._replace(type=_token.EXACT_TOKEN_TYPES[token.string])
Comments
Single-line comments are beginning with //
# consume the comment
if token.string == '//':
start = token.start[0]
with suppress(StopIteration):
while tokens.peek().start[0] == start:
next(tokens)
and multi-line comments are marked with {}
# and the multiline comment
elif token.string == '{':
nesting = 1
try:
while nesting > 0:
token = next(tokens)
while token.string != '}':
if token.string == '{':
nesting += 1
token = next(tokens)
nesting -= 1
except StopIteration:
raise SyntaxError('Unmatched "{"') from None
Here I added support for nested comments. I'm not sure if Pascal originally supported them, but some implementations do, and it's pretty straightforward anyway.
In both cases we just consume stuff until we get to the comment's end
Various garbage
Pascal doesn't care about indentation, so:
elif token.type in (
TokenType.INDENT, TokenType.DEDENT, TokenType.ENCODING, TokenType.ENDMARKER, TokenType.NEWLINE
):
# do nothing
pass
The inequality operator
The strangest part is the inequality operator, which is <>
in Pascal
# fix the `<>` operator
elif token.string == '<' and tokens and tokens.peek().string == '>':
# consume the second half
next(tokens)
yield token._replace(string='<>')
Ranges
finally, in Pascal you can declare ranges of numbers, such as 1..10
, Python would tokenize it as 1.
(the 1.0 float)
and .10
(the 0.1 float). Another patch to the rescue:
# unpack floats
elif token.type == TokenType.NUMBER and (token.string.startswith('.') or token.string.endswith('.')):
body = token.string
split = (
token._replace(string='.', type=TokenType.DOT),
token._replace(string=body.strip('.')),
)
if body.startswith('.'):
yield from split
else:
yield from split[::-1]
The rest
and that's it, other tokens are left unchanged:
else:
yield token
And here you can view the full code for tokenization.
In the next post we'll deal with parsing. That one won't be a 4min read