Towards Improved Sentence Representations using Token Graphs

ICLR 2026
1University of Bonn   2Lamarr Institute for Machine Learning and Artificial Intelligence
3University of Cambridge   4Ben-Gurion University of the Negev
GLOT Architecture Teaser

Abstract: Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it is competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs.

Methodology: Graph Learning Over Tokens

Instead of discarding multi-hop linguistic dependencies, GLOT explicitly models token interactions through a three-step framework.

Step 1: Token Graph Construction

We treat the hidden states \( \mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_L] \) from a frozen LLM as nodes. Edges are formed based on cosine similarity. To isolate the strongest semantic dependencies, we prune weak connections that fall below a threshold \( \tau \), creating a sparse latent token graph.

Step 2: Refinement with Token-GNN

Tokens exchange information with their semantic neighbors via a \( K \)-layer message-passing Graph Neural Network. By aggregating neighborhood context \( \mathbf{a}_i^{(\ell)} \), GLOT captures complex dependencies that standard independent pooling cannot see, resulting in structurally refined tokens \( \mathbf{U} \).

Step 3: Readout Layer

Finally, we compute a learnable importance weight \( \pi_i \) for each refined token. A weighted sum collapses the graph into a single, robust sentence representation \( \mathbf{z} \). Because relational learning occurs before aggregation, GLOT knows exactly which tokens carry core meaning.

Key Results

🛡️ Robust to Signal Dilution

On a diagnostic task where 90% of tokens are random distractors, standard baselines (Mean, Max, AdaPool) collapse to random chance. GLOT maintains >97% classification accuracy.

Chart showing GLOT maintaining over 97% accuracy at 90% signal dilution compared to baselines collapsing.

📈 Superior General Language Understanding

GLOT consistently outperforms mean, max, [CLS] and learnable pooling (AdaPool) across both encoder-only (BERT, RoBERTa) and decoder-only (Llama, Mistral) models.

Chart showing GLOT maintaining superior performance on GLUE compared to standard pooling methods.

⚡ High Performance, Minimal Cost

Compared to baseline fine-tuning techniques, GLOT achieves competitive performance while drastically reducing both memory footprint and runtime.

Full Fine-Tuning

7.11 B Parameters
32.59 GB Memory
1318 ms Runtime

LoRA (r = 64)

167.8 M Parameters
33.50 GB Memory
1454 ms Runtime

GLOT

8.9 M Parameters
0.42 GB Memory
13.4 ms Runtime

BibTeX

If you find our work useful, please cite our paper:

@inproceedings{mantri2026towards,
title={Towards Improved Sentence Representations using Token Graphs},
author={Krishna Sri Ipsit Mantri and Carola-Bibiane Sch{\"o}nlieb and Zorah L{\"a}hner and Moshe Eliasof},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=stMX9KBhUI}
}