Skip to content

Blog

Welcome to my blog! Here I mostly write about things that I build in my spare time.

Learning a new language with LLMs

I love learning new languages, both programming and natural ones. It feels a bit magical when you realise how words in different languages are related and why they sound the way they do.

There's been a few months since I started learning european portuguese. I really like it so far, especially how it sounds. Currently, I'm at a point where I can confidently read and understand almost any text, and, more or less, understand what native speakers are telling me. My main problem now is my vocabulary - when I want to say or write something it takes forever to recall some words.

It's pretty funny because sometimes I just freeze trying to remember something and the person I'm talking to assumes I didn't understand anything and quickly switches to english. Yes, they say that practice is the best way to learn a new language, but how long will it take, if the first steps are so hard?

Recently I stumbled across the TinyStories paper. It's a great read, check it out! In brief, they generated ~2.7 million stories using the typical vocabulary of a 3-4 year old. They then trained several super small language models (I'm talking 4-6 orders of magnitude smaller than GPT-4 or Llama 2) on these stories and proved that models with around 30M parameters are not only fluent in this restricted subset of english, but also show signs of basic reasoning regarding the text's content.

Hey, if a model with just 30M parameters can do it, then I can too!

Then it hit me. This is a great starting point to increase my vocabulary. I'll translate random texts from TinyStories to european portuguese and then fix the mistakes I made. Skipping ahead, this works even better than I expected! I literally feel how I'm memorising new words and constructs. Of course, this effect will saturate over time, but for now, this is the fastest I've ever learned a new language!

I ended up writing a tool that simplifies most of the boilerplate. Its frontend is written in Angular, the backend is FastAPI and, of course, I use ChatGPT to fix my translations.

As always, here you can find the full code. And here is a live version which can be used in two modes:

  1. Static - just a static page with a small subset of the TinyStories (~3k). It will generate a prompt that you'll have to manually paste into the chat with ChatGPT. It's not 100% automatic, but it's free, and I tried to simplify the process as much as I could.
  2. Automatic - you have access to all ~2.7mil stories and all the checks are made though API calls to ChatGPT, but you'll need an API token for that, and most importantly, you'll need to trust me that I won't steal it! Sadly, OpenAI's API doesn't support CORS, so I have to route all the requests though a self-hosted proxy server.

You can use the string free as the token, in which case it will be replaced with my personal token server-side. It's limited to 1 request per 10 minutes because I don't want to go broke, but I hope it should be enough. Enjoy!

Compiling Pascal with LLVM: Part 1

I always wanted to learn LLVM, but I never felt that there are some useful problems I could solve with it in my line of work. Eventually I decided to just have some fun and make something dumb and not useful at all. Yes, we're gonna compile Pascal! A language that I used for the last time like 15 years ago.

This series of posts is highly inspired by the brilliant book "Crafting interpreters" by Bob Nystrom as well as the official tutorial for LLVM. If you're into parsers or compilers, you should definitely check them out!

This is a series of four posts:

  1. Tokenization
  2. Parsing
  3. Typing
  4. Compilation

And here you can view the final result!

Why Pascal?

There are two main reasons:

  • Pascal is in a kind of sweet spot: it has a pretty simple grammar, so writing a parser would be fairly easy, but it has a lot of constructs not covered in the LLVM tutorial, like references, pointers, record types, functions overloading, static typing and so on
  • back in school my friend wrote a full-blown roguelike in Pascal, and it would be really cool to be able to compile it by myself. So yes, nostalgia plays a role in it, duh.

What you'll need

Everything is written in Python3.11 with the llvmlite package. You can find the (almost) full implementation here. It lacks some minor stuff, like subrange types, but at this point adding them is more about implementing a small interface, than inventing something new.

Feel free to open an issue or PR if you want to contribute in any way!