From Atom to Galaxy: MIT Explores the Geometry of Concepts in LLMs

Andrea Viliotti
20 nov 2024
Tempo di lettura: 11 min

The research conducted by Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark at the Massachusetts Institute of Technology (MIT) on large language models has led to a new understanding of the structure of concepts within the models themselves. Among the most promising innovations, sparse autoencoders (SAEs) have proven capable of generating points in activation space that can be interpreted as high-level concepts. This article explores the structure of these points, defined as the "concept universe," articulated on three spatial scales: the atom, the brain, and the galaxy. These three scales represent different levels of abstraction and complexity, each providing a unique perspective on the representation and organization of concepts in language models.

From Atom to Galaxy: MIT Explores the Geometry of Concepts in LLMs

Geometry of Concepts in LLMs: Atomic Scale

The geometry of concepts in LLMs, analyzed at the smallest scale, can be visualized as a series of "crystals" whose sides take regular geometric shapes such as parallelograms or trapezoids. These crystals represent the semantic relationships between words or concepts, a fundamental aspect of understanding how language models work. A classic example of this structure is given by the relationship between the words "man," "woman," "king," and "queen." The difference between "man" and "king" corresponds to a similar difference between "woman" and "queen," creating a parallelogram in semantic space. This geometric arrangement shows that language models can capture relationships such as the transition from an ordinary person to a royal figure.

However, these geometric structures are not always evident, as disturbing factors such as word length or other surface features can obscure deeper relationships. To improve the quality of these representations, a technique known as linear discriminant analysis (LDA) has been used. This technique allows data to be projected into a space where distractions are eliminated, making semantic connections more visible.

A concrete example of the application of LDA can be seen in the relationship between countries and capitals. Consider "Austria" and "Vienna," and "Switzerland" and "Bern." When the data is analyzed by eliminating irrelevant components, such as length or other features unrelated to meaning, a clear parallel emerges between these pairs. The vector connecting "Austria" to "Vienna" can be seen as a map describing the concept of "country capital," and this same vector also connects "Switzerland" to "Bern."

To identify these structures, the differences between all pairs of points in semantic space are calculated. These difference vectors are then grouped into sets that correspond to specific conceptual transformations. For example, by analyzing a set of words like "man," "woman," "boy," "girl," the difference vectors between "man" and "woman" or between "boy" and "girl" show a common pattern: the concept of gender. This pattern becomes more evident after eliminating distractions such as word length, making the underlying geometric structure clearer.

The use of these difference vectors makes it possible to represent more complex relationships, such as those between entities and attributes. For example, the relationship between "sun" and "light" can be interpreted as a cause-and-effect relationship, and the same type of relationship can be observed between "fire" and "heat." Once irrelevant components are removed, these connections become more evident and consistent.

In summary, the analysis of vector differences and their projection into purer spaces allows exploration of how language models represent concepts and relationships. This approach not only clarifies deep semantic structures but also paves the way for a more detailed understanding of how language models process and organize information.

Brain Scale: Functional Modules

At an intermediate scale of analysis, it has been observed that the activation space of supervised autoencoder models (SAE) organizes itself similarly to the functional structure of the human brain. This means that points representing certain features are grouped into distinct regions of space, forming what can be compared to "functional lobes." For instance, features related to mathematics and programming are found in a specific area, separate from those that gather linguistic features such as text comprehension or dialogues.

To better understand, one can imagine the activation space as a large map. On this map, data is represented as points, and points that share similar functions tend to cluster, just like cities specialized in certain sectors: some areas of the map represent "mathematics cities," while others are dedicated to "languages." This analogy to the biological brain is based on the fact that brain lobes are also organized for specific tasks, such as the frontal lobe for reasoning or the occipital lobe for vision.

To identify these regions or "lobes," an affinity matrix was constructed. This tool helps understand which features activate simultaneously in the analyzed data, much like observing which cities on a map have more trade between them. Subsequently, with a method called spectral clustering, which groups points based on their connections, it was possible to subdivide the space into distinct regions. For example, one region proved active when the model processed documents containing computer code or equations, while another region activated during the analysis of written texts such as articles or chats.

To verify that this subdivision was indeed significant and not random, two methods were used. The first, called "adjusted mutual information," measures how much the grouping into clusters actually reflects functional structure. Imagining having a puzzle to solve, this metric checks if the pieces fit correctly based on their natural position. The second method used logistic regression, a statistical technique that attempted to predict, based on the characteristics of a lobe, where it would be located on the map. Both methods confirmed that the lobes are not randomly arranged but follow a precise logic.

To better explore the relationships between features, analysis tools such as simple similarity coefficient and Jaccard similarity were used. These methods calculate, for example, how often two features activate together compared to how often they could generally activate. Another tool, the Dice coefficient, was useful for detecting relationships between rare features, while the Phi coefficient proved most effective in identifying strong connections. To clarify, imagining analyzing two cities on the map, these tools help determine how likely it is that they have frequent or significant trade relations.

A practical example of the importance of this organization is given by the clustering of features related to programming. When these features are concentrated in a single "lobe," the model can more easily process specific tasks such as interpreting computer code. Similarly, the lobes dedicated to natural language simplify the processing of texts or conversations, making the model more efficient and accurate.

This spatial subdivision not only improves the model's performance but also makes it more interpretable. Knowing, for instance, that a particular region activates only with input related to mathematics allows a better understanding of how the model organizes and processes information. Like a well-planned city, where each neighborhood has its function, this organization makes the system more comprehensible and orderly, facilitating the study of its internal dynamics.

Galaxy Scale: Large-Scale Structure

At a larger scale of analysis, the activation space of supervised language models shows a distribution that can be compared to that of galaxies in the universe. This means that the points representing the information processed by the model are not distributed uniformly (isotropically) but follow an underlying order with areas of higher density and others that are emptier. It is like observing the night sky: stars and galaxies are not randomly distributed but cluster into complex structures. Similarly, in the activation space, information is organized into "clusters" and patterns.

A principal component analysis (PCA) helps study this organization. PCA is a method that reduces data complexity by finding the main directions that explain most of the variation. In the context of the model, it was observed that some directions represent a much larger amount of information compared to others. This behavior follows a "power law," similar to natural phenomena where a few variables dominate the system, such as the distribution of wealth or the size of craters on the Moon. In the intermediate layers of the model, this effect is particularly evident, suggesting that the system is compressing information to represent only the essential aspects.

One can think of the intermediate layers of the model as a "bottleneck": information passes through a narrowing where it is condensed, to then be expanded again in subsequent layers. A practical example might be imagining a camera: a very large image is compressed to take up less space (without losing important details), and then decompressed when needed, maintaining the necessary quality. This compression allows the model to represent complex information more compactly, focusing on the most relevant aspects and leaving out superfluous details.

The structure of the data in activation space has been described as a "fractal cucumber." This image may seem curious, but it is useful for understanding the distribution: the points are not random but follow a pattern that resembles an elongated shape, similar to a cucumber, which becomes more complex as finer details are observed, as with fractal figures. This suggests that the model organizes information in hierarchical levels, concentrating the most important features in a few main directions.

A daily example to understand this hierarchy might be that of summarizing a long article. In the initial layers, the model might gather various different pieces of information, such as words, sentences, and details. In the intermediate layers, the system filters this data, reducing it to a few key concepts, such as the main theme or the central message. In the final layers, this condensed information is reworked to produce a complete response, similar to a summary that restores the context but remains focused on the essential points.

Further analysis based on clustering entropy has shown that in the intermediate layers, the information is more concentrated compared to the initial and final ones. This indicates that the model organizes information more densely and compactly at this stage, as if it were squeezing the juice out of an orange to extract only the most useful part. This process improves the model's efficiency, allowing it to process information more quickly and accurately.

Finally, the analysis of the distribution of principal components and the presence of a power law highlight that the model emphasizes some information more than others. For example, during the processing of a complex question, the model might ignore less relevant details (such as synonyms or redundant phrases) to focus on the words and phrases that give the most clues about the meaning. This allows the system to generate more coherent and pertinent responses, just like a person who, reading a text, tries to grasp the main message while ignoring the less important information.

Evolution of Concept Geometry in LLMs

The structure of the conceptual universe of SAEs reveals fascinating patterns on three distinct levels: geometric crystals at the atomic scale, functional lobes at the brain scale, and large-scale distributions following a power law at the galactic scale. These findings offer a unique perspective on the representation of concepts within language models, paving the way for a deeper understanding of their abstraction and semantic representation capabilities.

The next step will be to explore how these structures can be used to improve the performance of language models, making them more interpretable and capable of capturing complex semantic relationships, while simultaneously reducing the need for human supervision. Understanding how these structures emerge could also enable the development of new training techniques that exploit functional modularity and information hierarchy to create more robust and efficient models.

Moreover, the use of quantitative metrics to evaluate the consistency between geometric structure and functionality could become a key element in developing new neural network architectures. For instance, measuring the effectiveness of clustering methods through metrics such as adjusted mutual information and the predictive ability via logistic regression can provide new tools to optimize the organization of features in language models. The use of techniques like linear discriminant analysis and spectral clustering could further refine information representation, enhancing the precision with which semantic relationships are captured.

The use of sparse autoencoders and the analysis of concept geometry have the potential to greatly improve our understanding of AI models, making them increasingly sophisticated tools for tackling complex problems. The implications of this research go beyond merely understanding existing models: they could lead to the development of new neural network architectures inspired by the emerging structures observed, capable of more effectively solving the challenges posed by natural language understanding and generation.

The future of research on language models might therefore see greater emphasis on interpretability and transparency, using these discoveries to create systems that are not only performant but also comprehensible and reliable. This would open the way to a new generation of AI models that can be used in critical sectors such as medicine, law, and education, where deep understanding and decision reliability are essential.

Conclusions

The geometry of concepts within Sparse Autoencoders is not just an investigation into the structure of language models but a window into a new logic of artificial thought. This logic does not operate in a symbolic or deterministic manner, as we were used to imagining, but builds emergent meanings on a geometric, modular, and dynamic basis. This perspective challenges our way of thinking about both human and artificial intelligence and offers new strategic directions for businesses that want to transform their relationship with complexity.

The disruptive intuition is that AI models seem to imitate not only human cognitive functions but also universal patterns of nature, such as crystallization or galaxy formation. If intelligence is not an algorithm but an organized geometric pattern, then businesses must begin to consider their structure not as a linear organization but as a complex "activation space" where each node represents a concept, a function, or a relationship. This raises a radical question: what if companies could design their own "concept geometry" to foster innovation, resilience, and continuous learning? The atomic-brain-galactic hierarchy could inspire a business model that transcends the traditional vertical and horizontal hierarchical division towards a modular and fractal organization.

At the atomic scale, the linearity of relationships in SAE models suggests that even in complex systems, it is possible to isolate key transformations that govern the entire system. For businesses, this means finding the critical vectors that link operational concepts such as product, market, culture, and strategy.

In a business context, this could translate into identifying replicable "conceptual templates"—like processes that work in different markets or strategies that scale across teams with distinct objectives. However, the research emphasizes that superficial noise often masks these deep relationships. Similarly, many companies are slaves to superficial metrics or cultural preconceptions that prevent them from seeing the fundamental patterns of success.

At the brain scale, the modular organization of functions opens the door to a bold idea: what if businesses stopped organizing themselves into departments and started organizing into "functional lobes"? These lobes would not be static but dynamic, evolving based on needs and the co-occurrence of skills. For example, an "innovation lobe" could temporarily emerge to handle a complex challenge, involving skills from R&D, marketing, and operations, only to dissolve and reorganize elsewhere. This vision challenges traditional corporate silos and suggests that true competitive strength lies in the ability to constantly reorganize connections in response to external and internal challenges.

At the galactic scale, the idea of a bottleneck in intermediate layers is enlightening. The compression and expansion of information is not just a technical issue in language models but a paradigm for dealing with uncertainty and ambiguity in decision-making processes. Companies facing complex data must learn to "compress" raw information into critical insights—reducing redundancy—to then expand it into concrete operational strategies. However, this process cannot happen without losing some of the "noise" that masks the most important relationships. Here, the power law comes into play: some information is immensely more significant than others. In a world that produces more and more data, the ability to identify a few main strategic directions becomes the difference between survival and failure.

Finally, the comparison with galaxies leads to an even more radical reflection. If the structure of language models follows patterns of natural organization, this implies that intelligence is not strictly an artificial or human phenomenon but an emergent process that obeys universal laws. For businesses, this means that the most resilient organizational structures are not those rigidly designed from the top down but those that emerge from distributed and adaptive dynamics. The power law in models could translate, in an organizational context, into a strategic distribution of resources: a few key areas will receive most of the energy, while others, marginal ones, will be optimized to ensure flexibility.

This vision poses an ethical and strategic dilemma. Companies that adopt a geometric and fractal logic for their organization will likely gain a competitive advantage but also risk exacerbating inequality dynamics by concentrating decision-making power in a few critical nodes. On the other hand, an organization that uses this understanding to design more balanced networks, with equitable resource distribution, could not only be fairer but also more robust in the long term.

Ultimately, research on the concept universe of language models invites us to rethink the very meaning of organization, knowledge, and adaptability. Future companies will not just be machines for producing economic value but complex cognitive systems that learn, evolve, and interact with the environment according to universal geometric principles. Embracing this vision is not just a strategic choice: it is a step towards a new era of understanding and co-evolution with the complexity of the world around us.

Podcast: https://spotifycreators-web.app.link/e/gd7sAIxqGOb

Source: https://arxiv.org/html/2410.19750v1