Regina = King - Man + Femme: Revisiting My 2016 Master's Project in the Age of LLMs

I was recently revisiting the code for my Master's project from 2016, which centered around a paper by Coulmance et al. titled "Trans-gram, Fast Cross-lingual Word-embeddings".

It was 2016. I was a master's student and this paper was my world.

Back then, the NLP landscape was a different planet. "Attention" was something you paid, not a mechanism. BERT didn't exist. And the idea that a computer could understand twenty-one languages simultaneously required clever geometry, not just throwing a trillion parameters at the problem.

Looking back at my implementation of Trans-gram nine years later I'm struck by how much we've gained in performance and how much "interpretability" we've traded away to get there.

A note on authorship: While I used a writing assistant to help structure and polish this post, every technical detail, insight and memory reflects my own experience. I meticulously guided the content to ensure it accurately represents the work I did and the lessons I learned.

The Internship Mission: Zero-Shot Sentiment

My supervisor, a data scientist tackling global social media data, presented me with a specific, thorny problem. He needed to analyze sentiment in brand mentions across dozens of languages: French, German, Spanish, but we only had reliable, labeled training data for English.

He handed me the Trans-gram paper with a challenge: "Could we train a sentiment model on English and deploy it on Spanish without a single labeled Spanish example?"

The hypothesis was elegant in its simplicity. If we could mathematically align the vector spaces of different languages, a classifier learning that the vector for "excellent" (English) predicts positive sentiment would automatically trigger for "excelente" (Spanish), simply because they would occupy the exact same coordinate in the high-dimensional space.

The 2016 Context: The Era of Vector Arithmetic

To understand why this was cutting-edge, recall the NLP landscape of 2016: Word2Vec dominated and the idea of aligning languages without parallel corpora was still novel. The equation $King - Man + Woman = Queen$ wasn't just a meme; it was proof that meaning could be mapped to a universal coordinate system.

The holy grail was cross-lingual alignment. If we could map English words to a vector space and French words to a different vector space, could we rotate them so they overlapped?

Most methods at the time required expensive "word alignments" (knowing exactly which English word mapped to which French word). Then my supervisor pointed me to Trans-gram.

The Elegant Efficiency of Trans-gram

The paper proposed something beautifully simple (and computationally cheap). Instead of needing precise word-to-word dictionaries, it used sentence alignments.

The logic, based on the Skip-gram model, relied on a bold but effective assumption: The meaning of a word is uniformly distributed across the translated sentence.

If I wanted to train an embedding for the English word "cat" using a French translation, I didn't need to know it mapped to "chat". I just told the model: "The English word 'cat' should predict every single word in the corresponding French sentence 'Le chat est sur le tapis'."

Over millions of updates, the noise cancelled out. "Cat" would co-occur frequently with "chat" and rarely with "tapis" and the vectors would naturally align.

What I Built (And How Fast It Was)

Implementing this for my master's project was an exercise in efficiency. The paper boasted that it could align 21 languages (using 40-dimensional vectors) with English as a pivot in just 2.5 hours on a standard 6-core CPU.

I replicated this "Star Topology," aligning 20 other Europarl languages (French, German, Czech, Finnish, etc.) to English simultaneously.

The results were tangible: for the first time, I could inspect the dimensions of cross-lingual vectors and see the math work in practice. The paper prominently featured this cross-lingual identity in its header:

$\text{vector}(\text{rey})_\text{es} - \text{vector}(\text{Mann})_\text{de} = \text{vector}(\text{regina})_\text{it} - \text{vector}(\text{femme})_\text{fr}$

To verify this in my console, I rearranged the terms to solve for the missing Italian word:

$\text{vector}(\text{rey})_\text{es} - \text{vector}(\text{Mann})_\text{de} + \text{vector}(\text{femme})_\text{fr}$

And watched the nearest neighbor search spit out: Regina (Italian for Queen). We were doing algebra across four different languages at once. This worked because Trans-gram's loss function (Equation 2 in the paper) ensured that semantically similar words across languages, like "excellent" and "excelente", converged to similar vectors, even without explicit supervision.

The Magic of the Pivot: Feature Transfer

This alignment was crucial for my supervisor's sentiment analysis goal. The most fascinating part of the paper (Section 6.2) was the transfer of linguistic features.

English is a morphologically simple language; we just say "eat." French and Italian are complex; they have specific conjugations for "we eat" (mangeons / mangiamo) or "to eat" (manger / mangiare).

Since I was using English as the pivot, you would expect that nuance to be lost. If French maps to English and Italian maps to English, the "link" is the simple English word "eat."

But that's not what happened.

The Trans-gram model aligned conjugations like mangeons and mangiamo despite English lacking those distinctions. This suggested the optimization preserved latent linguistic structure, a critical insight for my sentiment analysis task, where emotional nuance often hinges on verb forms.

For my project, this zero-shot sentiment analysis was a task the Trans-gram authors didn't explore but enabled through their method. If the model could align complex grammar through a simple pivot, it could certainly align the subtle gradients of emotion needed for accurate classification.

Once we had these new word embeddings, we used them to check the results with our sentiment analysis model. We swapped the English vectors for the aligned Spanish and French ones and ran the classifier on the foreign language data. The English-trained model correctly predicted sentiment in Spanish and French, purely because the underlying geometry of "good" and "bad" had been aligned during embedding.

2016 vs. 2025: The Death of Explicit Alignment

Revisiting this code today highlighted three massive shifts in how we approach NLP.

1. From Geometry to Probability

In 2016, cross-lingual NLP was a geometry problem. We were trying to rotate distinct vector spaces until they clicked together. Today: It's a probability problem. LLMs don't "align" languages explicitly. They consume so much data that the alignment is an emergent property of the model's vast internal state. We don't calculate cosine similarities anymore; we just prompt: "Translate this to Swahili."

2. From Word-Level to Sub-Word Soup

My Trans-gram implementation had a vocabulary list. train_fr was a specific entry. Today: Tokenization (BPE, SentencePiece) has killed the "word." We now embed fragments. Static word vectors feel quaint in an era of dynamic, context-aware embeddings, but the core idea of distributed representations endures. This solves the "Out of Vocabulary" problem we battled constantly in 2016, especially with social media data. Back then, casual language, typos, or invented slang like "yasss" or "loool" became unknown `[UNK]` tokens that broke the model. Today's sub-word units handle this messy reality effortlessly but it makes the model much harder to interpret. You can't ask GPT-4 for the vector of "cat"; you get a sequence of context-dependent states.

3. Efficiency vs. Scale

My master's project ran on a laptop and finished before lunch. Today: Training a state-of-the-art multilingual model requires a cluster of H100s and enough electricity to power a small town. We have traded the elegance of 40-dimensional efficiency for the brute-force capability of trillion-parameter scale.

Conclusion

There is a specific joy in the "Small Data" era of 2016 that I miss. There was a tactile feeling to training the Trans-gram model, watching the loss curve dip and knowing exactly why the vectors were moving.

Modern LLMs are objectively better. They handle nuance, context, and syntax in ways my simple Skip-gram model never could. But Trans-gram proved that you don't always need massive compute to find meaning. Sometimes, you just need a clever assumption about how languages connect.

As I dive into the world of Large Language Models and tackle these new NLP challenges, I appreciate the logic of the Transgram model for what it was: a bridge. It bridged languages using simple math and it bridged my understanding from student to engineer.

Most importantly, it reinforced a lesson ingrained in me during my preparatory classes and math studies: we should always aim for simplicity, as it is the hallmark of a truly intelligent solution. This drive for simplicity was engrained with the objective to have a truly business impact, since the goal was to improve sentiment analysis. Trans-gram didn't need a trillion parameters to be smart; it just needed the right geometric assumption.

$\text{Good}_\text{en} - \text{Bye}_\text{en} + \text{Au Revoir}_\text{fr} = \text{See you later}.$