Learning a new language with LLMs
I love learning new languages, both programming and natural ones. It feels a bit magical when you realise how words in different languages are related and why they sound the way they do.
There's been a few months since I started learning european portuguese. I really like it so far, especially how it sounds. Currently, I'm at a point where I can confidently read and understand almost any text, and, more or less, understand what native speakers are telling me. My main problem now is my vocabulary - when I want to say or write something it takes forever to recall some words.
It's pretty funny because sometimes I just freeze trying to remember something and the person I'm talking to assumes I didn't understand anything and quickly switches to english. Yes, they say that practice is the best way to learn a new language, but how long will it take, if the first steps are so hard?
Recently I stumbled across the TinyStories paper. It's a great read, check it out! In brief, they generated ~2.7 million stories using the typical vocabulary of a 3-4 year old. They then trained several super small language models (I'm talking 4-6 orders of magnitude smaller than GPT-4 or Llama 2) on these stories and proved that models with around 30M parameters are not only fluent in this restricted subset of english, but also show signs of basic reasoning regarding the text's content.
Hey, if a model with just 30M parameters can do it, then I can too!
Then it hit me. This is a great starting point to increase my vocabulary. I'll translate random texts from TinyStories to european portuguese and then fix the mistakes I made. Skipping ahead, this works even better than I expected! I literally feel how I'm memorising new words and constructs. Of course, this effect will saturate over time, but for now, this is the fastest I've ever learned a new language!
I ended up writing a tool that simplifies most of the boilerplate. Its frontend is written in Angular, the backend is FastAPI and, of course, I use ChatGPT to fix my translations.
As always, here you can find the full code. And here is a live version which can be used in two modes:
- Static - just a static page with a small subset of the TinyStories (~3k). It will generate a prompt that you'll have to manually paste into the chat with ChatGPT. It's not 100% automatic, but it's free, and I tried to simplify the process as much as I could.
- Automatic - you have access to all ~2.7mil stories and all the checks are made though API calls to ChatGPT, but you'll need an API token for that, and most importantly, you'll need to trust me that I won't steal it! Sadly, OpenAI's API doesn't support CORS, so I have to route all the requests though a self-hosted proxy server.
You can use the string free
as the token, in which case it will be replaced with my personal token server-side.
It's limited to 1 request per 10 minutes because I don't want to go broke, but I hope it should be enough.
Enjoy!