Recent Progress on Information Retrieval
Reading across a selection of papers from ICLR 2026 (plus one arXiv preprint), a single argument emerges: 2026 information retrieval work is converging on a shared strategy of pushing cost off the query and online path. Whether the lever is data (Revela trains with far less of it), query-time compute (LightRetriever shrinks the query encoder), index footprint (MILCO and CSRv2 attack representation cost from opposite ends), or model size (the Transformer-SSM hybrid prunes attention to a retrieval-critical core), the recurring move is to relocate expense to where it is cheapest to absorb (offline indexing, fewer labels, or a smaller serving model) rather than to make retrieval itself fundamentally smarter.
The papers below navigate this shared goal from different angles. The table groups them by theme and names the tradeoff each one accepts in exchange for its efficiency gains.
| Theme | Papers | Where cost is moved | Tradeoff accepted |
|---|---|---|---|
| Cheaper supervision | Revela | From labeled pairs to an unsupervised language-modeling signal | Caps out at unsupervised quality; no annotated relevance signal |
| Cheaper query-time compute | LightRetriever | From the online query encoder to offline document encoding | Roughly 5% quality gap; out-of-domain behavior less characterized |
| Cheaper sparse representations | CSRv2, MILCO | CSRv2 to ultra-sparse dimensions; MILCO to a shared English lexical space | CSRv2 degrades at the lowest sparsity; MILCO inherits English-pivot bias |
| Cheaper indexing units | SPLARE, DISCo | SPLARE to SAE latents (also yielding language-agnostic representations); DISCo to submodular subset coverage | SPLARE depends on pretrained SAE quality; DISCo adds formulation complexity |
| Cheaper serving model | Transformer-SSM hybrid | From full attention to a retrieval-critical head subset plus SSMs | Validated only at small scale and short context |
Dense Retrieval with Revela
Revela: Dense Retriever Learning via Language Modeling Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Tongshuang Wu, Iryna Gurevych, Heinz Koeppl. ICLR 2026.
Revela frames dense retriever training as a language modeling objective, conditioning next-token prediction on cross-document context via in-batch attention weighted by retriever similarity scores. It achieves unsupervised state-of-the-art on BEIR with roughly 1000 less training data and 10 less compute than supervised baselines, and on CoIR and BRIGHT it surpasses larger supervised models without using any annotated or synthetic query-document pairs.
Motivation. Conventional dense retriever training relies on human-annotated query-document relevance pairs, which are expensive to obtain and difficult to scale. Revela eliminates this dependency by deriving a supervisory signal entirely from language modeling.
Core idea. Given a batch of text passages, a language model (LM) performs next-token prediction on each passage. The key modification is that the LM is augmented with cross-document attention: when processing passage , it may attend to the hidden representations of other passages in the batch. A retriever model controls the attention weights over these cross-document interactions. If attending to passage improves the LM's predictive accuracy on passage , the gradient signal encourages the retriever to assign higher similarity to the pair. In this way, the retriever learns to identify semantically related passages without any explicit relevance labels.
Architecture. Revela comprises two components, both initialized from the same pretrained LLaMA checkpoint and fine-tuned with separate LoRA adapters:
- Retriever (encoder): a causal LLaMA model that encodes each passage into a single vector via last-token pooling. It produces pairwise similarity scores over the batch.
- Language model (reference): a modified LLaMA model augmented with cross-document attention layers, used for next-token prediction.
Training procedure
Consider a batch of passages . Each training step consists of three forward passes. The pseudocode below is an illustrative reconstruction of Revela's in-batch-attention mechanism, spelled out with more architectural detail (the explicit three-pass structure and KV-cache reuse) than the paper states verbatim; treat it as a faithful interpretation rather than the exact implementation.
Pass 1: Pairwise similarity computation.
The retriever encodes each passage into a vector representation, and pairwise cosine similarities are computed:
for i = 1 to N:
h_i = RetrieverEncode(d_i) // last-token pooling
h_i = h_i / ||h_i|| // L2 normalization
S[i, j] = h_i · h_j // cosine similarity matrix, N × N
// Compute attention weights with self-similarity masked out
for i = 1 to N:
S[i, i] = −∞
W = row-wise softmax(S / τ) // τ is a temperature hyperparameter
Here is the attention weight matrix, where indicates how much passage should inform the LM's prediction on passage . At initialization the retriever produces near-uniform weights ( for ), reflecting the absence of learned similarity structure.
Pass 2: Baseline language modeling (without cross-document attention).
The LM processes each passage independently under the standard causal objective:
for i = 1 to N:
outputs_i = LM(d_i, use_cache=True)
KV_cache[i] = outputs_i.key_values // save K, V matrices at each layer
L_baseline = outputs_i.loss // standard next-token prediction loss
The key-value (KV) cache for each passage is retained for Pass 3. These cached representations encode the semantic content of each passage as processed by the LM.
Pass 3: Language modeling with cross-document attention.
The LM processes each passage a second time, now augmented with cross-document attention that leverages the KV cache from Pass 2:
for i = 1 to N:
L_cross = LM(
d_i,
cross_attn_weights = W[i], // row i of the similarity matrix
cross_attn_kv = KV_cache // KV cache from all passages
).loss
Within the modified LM, each transformer layer performs two attention operations.
for each layer l = 0, 1, ..., L − 1:
Q, K, V = project(hidden_states)
// (a) Standard causal self-attention (within-document)
s = Attention(Q, K, V, mask = causal)
// (b) Cross-document attention (between-document)
for each pair (i, j) where i ≠ j:
b[i][j] = Attention(
Q = Q[i],
K,V = KV_cache[l][j],
mask = full // no causal constraint across documents
)
// (c) Weighted aggregation via retriever similarity
for each passage i:
b[i] = Σ_{j≠i} W[i,j] · b[i][j]
// (d) Combine self-attention and cross-document attention
output = s + b
// (e) Residual connection and feed-forward
hidden_states = LayerNorm(hidden_states + output)
hidden_states = hidden_states + FFN(hidden_states)
Two design choices are worth noting.
- Causal vs. full masking. Self-attention within a passage uses a causal mask (each token attends only to preceding tokens), whereas cross-document attention uses a full (bidirectional) mask. This is valid because there is no autoregressive dependency between distinct documents.
- Hidden state divergence from Pass 2. Although the self-attention operation is structurally identical to Pass 2, the hidden states differ because earlier layers have already incorporated cross-document information, producing different query, key, and value projections.
Gradient flow to the retriever.
The training objective is the cross-entropy loss from Pass 3. Since the attention weights are computed from the retriever's output in Pass 1 and used differentiably in Pass 3, gradients flow back through to the retriever parameters. Concretely:
- If attending to passage reduces for passage , the gradient increases , encouraging the retriever to assign higher similarity to the pair.
- If attending to an irrelevant passage introduces noise and increases the loss, the gradient decreases the corresponding weight.
Over the course of training, the retriever learns to distinguish semantically related passages from unrelated ones, guided solely by the language modeling signal.
Inference
After training, the LM component is discarded; only the retriever encoder is retained.
Offline indexing: Each document in the corpus is encoded into a dense vector and stored in an approximate nearest neighbor index (e.g., FAISS).
Online retrieval: A query is encoded using the same retriever, and the top- nearest documents are retrieved via vector similarity search. This inference pipeline is identical to that of other dense retrievers such as DPR or E5.
Additional Reading
SPLARE: Learning Retrieval Models with Sparse Autoencoders. ICLR 2026. Uses SAE latent features instead of vocabulary-space projections as indexing units for learned sparse retrieval. SPLARE-7B achieves strong results on MMTEB multilingual and English retrieval tasks, and the representations are inherently language-agnostic. The main concern is the dependency on the availability and quality of pretrained SAEs; the 7B model size may also limit deployment in latency-sensitive settings.
DISCo: A Dense Subset Index for Collective Query Coverage. ICLR 2026. Formulates retrieval as submodular coverage: instead of ranking documents independently, DISCo finds small subsets whose combined representations collectively cover the query semantics, targeting multi-hop QA and text-to-SQL. Achieves favorable coverage-vs-latency tradeoffs via sublinear retrieval. The submodular formulation adds complexity over standard top- retrieval, and the practical gains over strong re-ranking baselines remain unclear.
LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference. ICLR 2026. Decouples query and document encoding: a lightweight query encoder achieves over 1000 speedup in query encoding while a full LLM handles offline document encoding, retaining ~95% retrieval quality with 10 end-to-end throughput gain. The asymmetric design is compelling, though the 5% quality gap may compound in downstream pipelines, and effectiveness on out-of-domain queries with the lightweight encoder is not well characterized.
MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector. ICLR 2026. Maps multilingual queries and documents into a shared English lexical space via a connector module. Outperforms BGE-M3 and Qwen3-Embed with 3 lower retrieval latency and 10 smaller index size. A LexEcho head preserves source-language entities that would otherwise be lost in English projection. The reliance on English as a pivot language may introduce biases for low-resource languages with limited English alignment data.
CSRv2: Unlocking Ultra-Sparse Embeddings. ICLR 2026. Pushes sparse embeddings to extreme sparsity ( to active dimensions) via progressive -annealing and supervised contrastive objectives. Achieves 14% accuracy gain at and up to 300 compute/memory savings over dense embeddings. The ultra-sparse regime is impressive for efficiency, but performance still degrades noticeably at the lowest sparsity levels, and the method has not been validated on retrieval-specific benchmarks beyond classification and matching tasks.
Retrieval-Aware Distillation for Transformer-SSM Hybrids. arXiv, February 2026. Identifies retrieval-critical attention heads (~2% of all heads) and replaces the rest with SSM modules, recovering over 95% of teacher performance with 5 to 6 memory reduction. A promising direction for efficient long-context models. Key limitations: validated only at 1B to 1.5B scale, evaluated up to 4K context length, and head selection relies on a synthetic probe task whose transferability to complex reasoning is untested.