From Tokens to Spaces: Rethinking Semantics Across Language and Vision

Comparison between Visual and Linguistic Features

Visual models inevitably contain conceptual structures within their sparse autoencoder (SAE) representations, but interpreting these concepts seems far less straightforward than in language. With language, semantics tend to be immediately determined by the tokens themselves, whereas visual semantics resist such direct mapping. Visual distinctions may allow for clustering and alignment, but the function that links representations to meaning is harder to establish than in linguistic domains.

In particular, linguistic representations can often be chained together in a chain-of-thought style, linking features to form logical circuits. Visual representations, however, emphasize continuity and spatial relations more than explicit logic. Perhaps following the trajectory of visual forms resembles the human mode of imaginative thinking, where spatial reasoning supports higher cognition. In this sense, visual encodings could be essential tools for helping humans reason about spatial and dynamic structures.

If language encodings capture relationships between explicit symbols and their conceptual anchors, visual encodings may capture concepts of space itself and the objects inhabiting it. To achieve this, we must deepen our understanding of the causal dynamics that unfold in space and time. Using language is like building on strong anchor points that directly evoke concepts, while using visual representations requires interpreting ambiguous shapes, their movements, transformations, and relations.

Ultimately, models will move toward integrating images and texts, just as humans combine logical information with spatiotemporal understanding. The challenge lies in how such vast streams of information can be efficiently encoded, stored, and interrelated. I firmly believe that AI’s modes of thought will elevate human cognition. Humans learn continuously from birth through imitation, and to support that process, we need ever more advanced models to imitate. Lessons and wisdom take shape in surprising forms that move us. I am only searching for those yet-unrevealed forms of wonder.


Additional Ideas

The difficulty of interpreting visual encodings stems from their continuous, compositional, and dynamic nature. Unlike discrete tokens in language, visual inputs vary smoothly under rotation, scale, and illumination. The same object may appear in countless forms, making direct semantic assignment elusive. Moreover, visual scenes are rarely composed of isolated elements—they blend objects, relations, and backgrounds in ways that resist clean symbolic partitioning. This makes the semantic “anchors” of vision inherently less stable than those of language.

One promising direction is object-centric learning, where models attempt to decompose a scene into object slots that serve as candidates for concepts. Each slot can then be aligned with linguistic tokens or grounded in functional predictions. Yet static object recognition is not enough: many visual semantics emerge from dynamics and causality. Predicting how balls collide, how liquids flow, or how agents interact in space teaches models not only what objects are but also how they relate and influence each other. Such causal grounding is indispensable for deeper semantic understanding.

Language and vision also differ in their compositional structures. In language, chain-of-thought reasoning can be viewed as a form of symbolic circuitry, where features activate along logical pathways. Vision, by contrast, seems more like continuity of concepts in a manifold: shapes, colors, and trajectories flow into one another, producing meaning not through explicit logic but through gradual transformations. This makes visual reasoning closer to imagination, where humans intuitively “see” possibilities in space rather than deduce them step by step.

From a methodological perspective, models that combine contrastive learning (e.g., CLIP) with scene graphs or graph neural networks can begin to bridge this gap. Contrastive objectives align ambiguous visual slots with strong linguistic anchors, while graph-structured reasoning explicitly encodes relationships among objects. At the same time, causal interventions—such as altering colors, positions, or dynamics in synthetic data—provide a way to test whether visual concepts correspond to meaningful features.

The larger trajectory is toward multimodal cognition, where models learn to integrate symbolic and spatial representations. Language provides discrete, explicit handles on meaning; vision provides continuity, embodiment, and dynamic grounding. Together, they approximate the full richness of human cognition, which relies on both logic and imagination. If these representational strategies are properly unified, AI may not only mimic but also augment human reasoning, offering us new conceptual anchors, new ways to imagine, and new forms of wisdom yet unseen.