Compression is Generalisation, Generalisation is Intelligence - Unsupervised Learning in Large Language Models

The predictive power of compression in large language models

Nov 22, 2024

This article explores what the Transformer Model architecture has enabled — the magic of unsupervised learning, which gives capabilities like reasoning and generalization bordering on human-level understanding to large LLMs.

Ilya Sutskever, the CTO and the man behind the GPT fame, arguably, one of the smartest men when it comes to LLMs, talks about Unsupervised Learning as the most important thing in Machine Learning, essentially the Holy Grail.

At OpenAI .. the hope was that if you have a neural network that can predict the next word, it’ll solve unsupervised learning. So back before the GPTs, unsupervised learning was considered to be the Holy Grail of machine learning.…

What he means here is not about self-supervised learning, that is the idea of using the next word as the implicit label in training, instead, he alludes to something far more profound and transformative, hinting at the deeper potential of unsupervised learning to unlock new frontiers in AI. He continues on how it was finally solved using the Transformer Neural Network.

But our neural networks were not up for the task. We were using recurrent neural networks. When the transformer came out, literally as soon as the paper came out, literally the next day, it was clear to me, to us, that transformers addressed the limitations of recurrent neural networks, of learning long-term dependencies. …And that’s what led to eventually GPT-3 and essentially where we are today. -Ilya Sutskever

The excitement, optimism, and wide-ranging applications surrounding Generative AI stem from the remarkable capabilities of Large Language Models (LLMs). These models, trained on trillions of word tokens, have evolved to do much more than predict the next word. They have learned to model the world itself, enabling them to generalize across various tasks. This remarkable ability is a product of unsupervised learning.

In this article, I aim to explore this phenomenon through the lens of compression theory and Kolmogorov complexity. While alternative interpretations exist, such as Claude Shannon’s Information and Coding Theory (focusing on entropy and information content) or the PAC-Bayes framework (connecting generalization to probability density functions via Kullback-Leibler divergence), these approaches are mathematically dense and less intuitive. Compression theory, on the other hand, provides a more intuitive and relatable framework for understanding how these models achieve their remarkable results.

Beyond Statistical Probability Distribution Calculators

Large Language Models (LLMs) are often regarded as statistical systems that compute probability distributions over their vocabulary, predicting the next token based on context using their learned attention mechanisms.

This is how they work, and this fact is so commonplace that most people and many experts still dismiss all capabilities of LLMs as just a glorified probability density function picker. This was especially true of the older and smaller models like GPT-2. But as the scale of the models started to increase exponentially along with the training data, emergent properties of generalization and reasoning happened which cannot be explained away without unsupervised learning.

The ability of these later and larger models GPT3/4, Gemini, and LLAMA2/3, to generate creative, reasoned, and contextually rich responses goes beyond mere calculations of probability density functions. These abilities stem from the emergent properties of their training process — properties that we might liken to human-like reasoning and understanding. There are multiple papers regarding this also from AI researchers. Still, the subject is usually discussed emprically and evidence is also highly subjective that it is still debated.

It is true that LLMs, such as GPTs and LLaMAs, operate by predicting the next token using probabilities derived from their auto-regressive architecture. But focusing solely on this mechanism by which and only by which they can generate output — akin to how vocal cords generate human speech — misses the essence of their capabilities. Understanding the mechanics of vocal cords does not fully explain the meaning or creativity of the words spoken. Similarly, the real intrigue lies not in the probabilities themselves but in how the model determines, from an infinite range of possibilities, the most contextually appropriate probabilities.

To explore this, we aim to investigate the technical underpinnings of these emergent properties, even if they are often described with inherently ambiguous terms like “reasoning” and “understanding.”

Supervised learning

Supervised Learning is simple: it is learning from labelled data. The best example of one of the largest labelled datasets is ImagNet. The huge amount of labelled image data collected as part of the ImageNet project helped AlexNet introduced in 2012 (Ilya Sutskever, Alex Krizhevsky and Geoffrey Hinton) revolutionised computer vision and started the Deep Neural Network revolution.

Supervised Learning works well. This is the bedrock of almost all machine learning used in production today.
“All the impressive achievements of deep learning amount to just curve fitting..”. The Truing Award winner Judea Pearl critique of AI/ML system. He is the one associated with formalising Causal reasoning.

Given enough labelled data, AI/ML systems will learn to approximate any complex multivariate functions. They are excellent universal function approximators. They become very good feature extractors of the data, associating features to the target label.

Limitations of Supervised Learning

The problem with Supervised Learning is that labelling huge amounts of datasets for training models needs a lot of time and effort.

The larger issue is that this approach doesn’t align with how humans learn, and it’s impractical — if not impossible — to label every possible intent, intention, or generalization.

The limitations of supervised learning become evident when considering an analogy: a system trained on labelled images of dogs and cats, even using the most advanced Convolutional Neural Network, will struggle to generalize and associate both categories under the broader concept of “animal,” even if explicitly labelled. This highlights the inherent challenge of capturing higher-level abstractions through supervised methods alone.

Unsupervised Learning in ML

Continuing from the understanding and usage of Supervised learning, many inadvertently associate unsupervised learning with Clustering via algorithms like K-means clustering or PCM. However, in the case of Neural Networks, there is universally a cost or loss function to optimise via computing the difference between the expected value and generated value. So there always needs to be an expected or target value. Algorithms like Clustering, Principal Component Analysis etc are classic ML algorithms and have nothing to do with Neural Network Learning.

Self-supervised Learning

Self-supervised learning (SSL) is a type of machine learning where the model learns from unlabeled data by generating its own labels from the data itself.

Assume that the Neural Network has to predict the preceding sentence in this passage. It can be given the first word “Those” and ask it to predict anything next. Say it predicted “monks”, but the expected is “with”. So using this concept the loss function can be computed with expected labels, without explicitly having to label the data, backpropagated and weights of attention layers as well as the final feed-forward layers adjusted and learned.

Self-supervised Learning unlocks Training Data scale

What this truly unlocks is the whole fifteen trillion token or more of the World Wide Web, including code in Github, blogs, Quora, articles, and anything without a paywall are legitimate “training sets” — https://www.educatingsilicon.com/2024/05/09/how-much-llm-training-data-is-there-in-the-limit/

Scale is Important

There is a big difference with scale. Both the scale of the Network (number of trainable parameters) and the scale of training data.

source:https://research.google/blog/pathways-language-model-palm-scaling-to-540-billion-parameters-for-breakthrough-performance/

As the scale of the model increases, the performance improves across tasks while also unlocking new capabilities

A tiny GPT can be trained on a few words from a book. It will then be able to predict similar or exact as per the training data the next word, given the preceding word. This is self-supervised learning but not Unsupervised Learning. That is the model is unable to understand the structure or knowledge embedded in the data. For an example see this colab where I train gpt2 on a medical textbook and it overfits the data and predicts the next words, but cannot understand or answer questions -

Colab — GPT2 Overfitting- Training a small model

It is as limited as the computer vision model trained on Cat and Dog pictures, in its limitation to Generalise the concept of say Animal or the world model in which those images are.

The key to understanding Unsupervised learning is this intuition that the model or the Network learns much more than predicting the next word; it learns a world model and understands the meanings in the words.

It becomes much more than a Stochastic Parrot or a Statistical Probability Density Function picker like these papers imply — On the Dangers of Stochastic Parrots: Can Language Models Be Too Big -Bender et al [1], GPT-3: Its Nature, Scope, Limits, and Consequences — Floridi, Chiriatti[2]

In the paper Playing Games with AIs: The Limits of GPT-3 and Similar Large Language Models -Sobieszek, Price [3], there is a slightly more nuanced and deeper take that enables us to glimpse a property of unsupervised learning.

A refresher about a few terms and probabilities in LLMs.

Tokens, Vocabulary and Probabilities in LLMs

In GPT-3 and similar language models, vocabulary refers to the set of all tokens that the model can recognize and use to generate or process text. GPT-3 has a vocabulary of 50,257 tokens. It is much larger in bigger models.

Tokens are the basic building blocks of text that the model processes. Tokens can represent: A single word (e.g., “apple”); a part of a word (e.g., “ap”, “ple”); a single character (e.g., “a”, “b”); special symbols (e.g., punctuation or control tokens like <|endoftext|>)

When GPT-3 generates text. At each step in the sequence, the model predicts the next token. The prediction is based on probabilities over all 50,257 tokens (its vocabulary). The token with the highest probability is chosen (or sampled, depending on settings like temperature).

For example: Suppose the text so far is “I love”.

GPT-3 calculates the probabilities for the 50,257 tokens to determine the next token. It might decide: “you” (most probable), “,” (second most probable), “Python” (less probable).

The probability is not uniformly distributed (i.e., not 1/50,257 for each token). Instead, GPT-3 uses its trained knowledge to assign higher probabilities to tokens that make sense contextually. For example:

Tokens like “you,” “Python,” or “,” might have higher probabilities, as they are common continuations of “I love.”

Tokens like “qzx” (if it exists) or random characters would have much lower probabilities because they are unlikely to fit the context.

Here is the colab code for the above graph that you can run with different texts. Colab: LLM Next word Probability Plotting that will output the probabilities.

A slightly different version where we can see it generating one token after the next and using the generated tokens to calculate the next word probability. Colab: Next word Prediction

Decoded Context: I love New York. I love the
Max Probability Token: ' the' (ID: 262 word 4)
Context: I love New York. I love the
Next Token Prediction: 'the'

            the | ########## 0.2021
            New | ###### 0.1305
             it | ## 0.0544
             my | # 0.0330
          being | # 0.0257
           this | # 0.0257
             to | # 0.0227
            all | # 0.0227
           that |  0.0188
         living |  0.0156
Decoded Context: I love New York. I love the city
Max Probability Token: ' city' (ID: 1748 word 5)
Context: I love New York. I love the city
Next Token Prediction: 'city'

           city | ######### 0.1903
         people | ## 0.0580
            way | # 0.0213
           fact | # 0.0213
          place |  0.0156
            New |  0.0147
          music |  0.0138
        country |  0.0101
          great |  0.0079
        culture |  0.0079

Note: The output of the last layers is fed to a softmax and probabilities of all the vocabulary tokens add up to 1. In real network there are parameters like Temperatures and Top-p and Top-k that are used as a mechanism to shape ,limit or pick from the final probability layer. See here an explanation of temperature https://community.wolfram.com/groups/-/m/t/2958851

To calculate the total number of possible combinations of three tokens that can be generated following the initial input “I” using GPT-3’s vocabulary of 50,257 tokens, we need to consider the following:

After “I,” there are 50,257 possible tokens for the next word. and so on for each word

The total number of possible combinations is 50,257×50,257×50,257. This means there are approximately 127 trillion possible combinations of three tokens following the initial token “I”, though the probable combinations will be much lower.

Using this understanding lets us go to Sobieszek, Price paper

When GPT continues a text, it picks one of these probable tokens, appends the text with it, and then recalculates the probability of continuations for the new prior text elongated by that token. This is the autoregressive quality of GPT-3 … Two things should be noted here: (a) many possible strategies of traversing this tree of possibilities exist (e.g. always picking the most probable token), ..b) The exponential increase in the number of possible sentences that stems from this process is enormous (given GPT-3’s vocabulary of 50,257, the number of different possible outputs becomes greater than the number of atoms of the universe at a length of just 18 tokens).

..in order to produce coherent continuations, GPT-3 must’ have saved into the weights of its connections a massive amount of information.

However, considering the exponential growth of the number of conditional probabilities that must be stored, this information must first be somehow compressed, and such compression usually cannot be lossless

What we’ll claim is lost during this compression and the generalizations made during training are the source of both the limitations and supposed intelligence of such models

From Playing Games with AIs: The Limits of GPT-3 and Similar Large Language Models Sobieszek, Price

Compression as a means of understanding Unsupervised Learning.

“prediction leads to compression, compression leads to generalisation, and generalisation leads to computer intelligence” [3]

There are many concepts related to compression. One way to conceptualize this is through Kolmogorov Complexity, which measures the minimum length of a program required to describe a dataset. It is best explained by an example:

Consider a dataset with 10,000 repeated instances of the number 1:

x=”111111…1111" (10,000 ones)

High Compressibility: A program print("1"*10000) is much shorter than the data itself. Kolmogorov Complexity K(x) is low here

Low Compressibility: For a truly random string of length 10,000 (e.g., sensor noise), there’s no shorter representation. Kolmogorov Complexity K(x) is approximately the size of xxx.

This means that if you can Compress data, the algorithm should know the structure of the data. Example ZIP. A file with non-repeatable data is harder to compress than one with repeatable data.

A text with a Jabberwocky poem can be compressed at only 40% but a file with “a” repeated can be compressed 90%.

alex@pop-os:~$ ls -lh jabberwocky.* 
-rw-rw-r-- 1 alex alex 941 Nov 18 20:08 jabberwocky.txt
-rw-rw-r-- 1 alex alex 657 Nov 18 20:07 jabberwocky.zip
alex@pop-os:~$ ls -lh aa.* 
-rw-rw-r-- 1 alex alex 899 Nov 18 20:28 aa.txt
-rw-rw-r-- 1 alex alex 210 Nov 18 20:28 aa.zip

It may surprise you that the most efficient text compressor is based on the Transformer model and LLM by Fabrice Bellard which used a language model for very high compression. Here is the paper https://bellard.org/nncp/nncp.pdf and https://bellard.org/nncp/nncp_v2.pdf and NNCP leading in text compression leaderboard https://www.mattmahoney.net/dc/text.html#1072, though the exact mechanism is not open-sourced and hence may not be relevant in this context.

Another example is human vision. We take in so much information/data through our eyes and senses. But our brain can compress it enough to generalise on the essentials.

Approximating Kolmogorov Complexity in an LLM

Let’s take the LLAMA 2 model. This is trained on 2 trillion tokens. Assuming an average of 4 bytes per token, the size comes to 8 trillion bytes or about 8000 GB.

Using the size of the models, a 13 Billion parameter LLAMA2 model is about 26 GB assuming that the code is just a few KBs and focussing only on weights (Model Size in Bits=Number of Parameters×Bits per Parameter /FP16).

That is 8000GB to 26 GB compression. Though Kolmogorov Complexity is not computable we can assume that it is very high here and hence that the model has the generalization capabilities.

And even then it can emit Jabberwocky almost line by line with some hallucinations towards the end. You can try this at https://api.together.xyz/playground/chat/meta-llama/Llama-2-13b-chat-hf

complete rest of Did gyre and gimble in the wabe
----
Sure! Here is the complete poem "Jabberwocky" by Lewis Carroll, with the lines you requested filled in:

" 'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

'Beware the Jabberwock, my son!
The frumious Bandersnatch
He left the dewlap globeless,
And the truncheon upright.

He spoke in words of sorrow,
'Though she be but a whisker.'"

Note that most of these models are trained for only one epoch on the entire data set and loss is not close to zero, as this one epoch requires itself very high resources and more than one epoch is not feasible as of now.

LLAMA2 Training loss https://arxiv.org/pdf/2307.09288)

There would be a lot of hallucinations regarding facts. It could be that Jabbewocky is a popular poem and would have appeared multiple times in the training set and hence this better result. See also Banishing LLM Hallucinations Requires Rethinking Generalization Johnny Li Saksham https://arxiv.org/html/2406.17642v1

A larger model like ChatGPT 4o can reproduce without much mistake the original text of many passages given a good starting prompt https://www.gutenberg.org/files/3618/3618-h/3618-h.htm

ChatGPT 4o model recalls without hallicunation the starting of Arms and Man play given a good prompt.

It seems that without some form of compression, the system cannot store this much information in its weights alone. We may need to check the possible information content that can be encoded in N weights to the training data to be more rigorous here but that is no easy task as non-compression-based mechanisms like in [5] are hard to calculate or estimate practically.

LLMs and Reasoning

There are many logical and reasoning questions online and it is difficult to pick a prompt and check if the model has reasoning or if it is just repeating from a similar training dataset for that prompt.

The rigorous way would be to check the score on benchmarks like MLMU (Multitask Reasoning) https://www.vellum.ai/llm-leaderboard where it is clear that LLMs have substantial reasoning capabilities.

Still, for the sake of an example, I went with something simple and absurd hoping that the model (LLAMA2 13b Chat) has not seen this yet just for illustration.

my Cow has jumped over the barn,said Thomas. Can Thomas be trusted
----
Thomas is a person who is making a statement, but the statement itself is impossible. Cows cannot jump over barns, as they are large animals that do not have the ability to fly or jump over tall structures. Therefore, Thomas cannot be trusted to make accurate statements about his cow's abilities.

Assuming that the model has not seen this type of prompt in Training, as there are infinitely different ways to make unique text, still, it has pieced together the context “that Thomas states his cow has jumped over a barn” and from the instruction prompt -” Can Thomas be trusted”, it has not statistically just picked out a probable answer like Yes/No Thomas can/cannot be trusted, but it has reasoned if it is “possible” for a “large animal like a cow” to jump or fly and hence deduced that the statement is “incorrect/unlikely” and that “Thomas statement cannot be trusted”.

Even a smaller model like LLAMA2 13B has rudimentary world model

Surely there are a lot of elements of Instruction fine tuning and alignment with human expected answers via Reinforcement learning in play here which could have influenced the answer, but the next-word prediction training has learned unexpected connections between concepts and this is the magic of Unsupervised learning.

This is the sort of reasoning or understanding that makes LLMs truly useful.

To argue from an example invites counter-examples. Can we extend that compression needs deeper understanding and this can be a basis for reasoning, which is arranging initial token order, so that it automatically answers the question?

Initial empirical proof of Unsupervised Learning GP2 and GPT3

The paper Language Models are Unsupervised Multitask Learners [6] introduced GPT-2 (a 1.5 Billion parameter model ) to the world based on Transformer architecture. It did something very extraordinary. It proved empirically that a sufficiently large LLM, when trained with a sufficiently large dataset (CommomCrawl / WebText- cleaned up internet data) starts to generalise so well that it is able to successfully execute many tasks better or at par on the level of State of the Art supervised systems at that time.

“GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.” https://openai.com/index/better-language-models/

This was the first practical demonstration of the capabilities of unsupervised learning, emprically verified with tests. The ranking in the tests can seen here — https://paperswithcode.com/paper/language-models-are-unsupervised-multitask

GPT-3 is a 175 Billion parameter model (Language Models are Few-Shot Learners[7] and it was trained almost exactly like GPT2 but on more tokens/ tain datasets and it improved considerably on the GPT2 benchmarks for NLP tasks- https://paperswithcode.com/paper/language-models-are-few-shot-learners

Explaining Unsupervised Learning through Emergence

Other than compression, another way of examining the property of unsupervised learning is through the concept of Emergence. This is not universally accepted as of now as there are counterarguments here. But this is the case with most things in AI and LLM now.

The 2022 Google/DeepMind research paper Emergent Abilities of Large Language Models [8]describes this phenomenon

Emergence is when quantitative changes in a system result in qualitative changes in behavior.
An ability is emergent if it is not present in smaller models but is present in larger models. [8]

It notes that the score in the benchmarks for various NLP tasks increases spontaneously beyond a model scale, though it offers no explanation as to the behaviour of emergence. The paper and others partly suggest that phase transitions in the internal representations or mechanisms could underlie emergent abilities. While metrics like cross-entropy loss show steady improvements during training, these phase transitions manifest as sudden, qualitative improvements in task performance, often without clear changes in the loss curve itself. These explanations are not intuitive.

Emergent Abilities of Large Language Models Jason Wei,Yi Tay https://arxiv.org/pdf/2206.07682

Till we get more theoretical underpinnings to the powerful abilities of unsupervised learning, lets us explore more of its potential via subjective and empirical means.

References

[1] On the Dangers of Stochastic Parrots: Can Language Models Be Too Big -Bender et al https://dl.acm.org/doi/pdf/10.1145/3442188.3445922

[2] GPT-3: Its Nature, Scope, Limits, and Consequences — Floridi, Chiriatti https://dl.acm.org/doi/10.1007/s11023-020-09548-1

[3] Playing Games with AIs: The Limits of GPT-3 and Similar Large Language Models Sobieszek, Price https://link.springer.com/article/10.1007/s11023-022-09602-0

[4] Ilya Sutskever An Observation on Generalization

[5] Computing the Information Content of Trained Neural Networks — Jeremy Bernstein, Yisong Yue

[6] Language Models are Unsupervised Multitask Learners — Alec Radford Jeffrey Wu https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (GPT2)

[7] Language Models are Few-Shot Learners https://arxiv.org/pdf/2005.14165 [GPT-3]

[8] Emergent Abilities of Large Language Models Jason Wei,Yi Tay https://arxiv.org/pdf/2206.07682

[9] Banishing LLM Hallucinations Requires Rethinking Generalization Johnny Li Saksham https://arxiv.org/html/2406.17642v1

Alex’s Substack

Discussion about this post

Ready for more?