Skip to main content

Recent Progress on Information Retrieval

Zeyu Yang
PhD student at Rice University

Reading across a selection of papers from ICLR 2026 (plus one arXiv preprint), a single argument emerges: 2026 information retrieval work is converging on a shared strategy of pushing cost off the query and online path. Whether the lever is data (Revela trains with far less of it), query-time compute (LightRetriever shrinks the query encoder), index footprint (MILCO and CSRv2 attack representation cost from opposite ends), or model size (the Transformer-SSM hybrid prunes attention to a retrieval-critical core), the recurring move is to relocate expense to where it is cheapest to absorb (offline indexing, fewer labels, or a smaller serving model) rather than to make retrieval itself fundamentally smarter.

The papers below navigate this shared goal from different angles. The table groups them by theme and names the tradeoff each one accepts in exchange for its efficiency gains.

ThemePapersWhere cost is movedTradeoff accepted
Cheaper supervisionRevelaFrom labeled pairs to an unsupervised language-modeling signalCaps out at unsupervised quality; no annotated relevance signal
Cheaper query-time computeLightRetrieverFrom the online query encoder to offline document encodingRoughly 5% quality gap; out-of-domain behavior less characterized
Cheaper sparse representationsCSRv2, MILCOCSRv2 to ultra-sparse dimensions; MILCO to a shared English lexical spaceCSRv2 degrades at the lowest sparsity; MILCO inherits English-pivot bias
Cheaper indexing unitsSPLARE, DISCoSPLARE to SAE latents (also yielding language-agnostic representations); DISCo to submodular subset coverageSPLARE depends on pretrained SAE quality; DISCo adds formulation complexity
Cheaper serving modelTransformer-SSM hybridFrom full attention to a retrieval-critical head subset plus SSMsValidated only at small scale and short context

Dense Retrieval with Revela

Revela: Dense Retriever Learning via Language Modeling Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Tongshuang Wu, Iryna Gurevych, Heinz Koeppl. ICLR 2026.

Revela frames dense retriever training as a language modeling objective, conditioning next-token prediction on cross-document context via in-batch attention weighted by retriever similarity scores. It achieves unsupervised state-of-the-art on BEIR with roughly 1000×\times less training data and 10×\times less compute than supervised baselines, and on CoIR and BRIGHT it surpasses larger supervised models without using any annotated or synthetic query-document pairs.

Motivation. Conventional dense retriever training relies on human-annotated query-document relevance pairs, which are expensive to obtain and difficult to scale. Revela eliminates this dependency by deriving a supervisory signal entirely from language modeling.

Core idea. Given a batch of text passages, a language model (LM) performs next-token prediction on each passage. The key modification is that the LM is augmented with cross-document attention: when processing passage AA, it may attend to the hidden representations of other passages B,C,B, C, \ldots in the batch. A retriever model controls the attention weights over these cross-document interactions. If attending to passage BB improves the LM's predictive accuracy on passage AA, the gradient signal encourages the retriever to assign higher similarity to the (A,B)(A, B) pair. In this way, the retriever learns to identify semantically related passages without any explicit relevance labels.

Architecture. Revela comprises two components, both initialized from the same pretrained LLaMA checkpoint and fine-tuned with separate LoRA adapters:

  1. Retriever (encoder): a causal LLaMA model that encodes each passage into a single vector via last-token pooling. It produces pairwise similarity scores over the batch.
  2. Language model (reference): a modified LLaMA model augmented with cross-document attention layers, used for next-token prediction.

Training procedure

Consider a batch of NN passages {d1,d2,,dN}\{d_1, d_2, \ldots, d_N\}. Each training step consists of three forward passes. The pseudocode below is an illustrative reconstruction of Revela's in-batch-attention mechanism, spelled out with more architectural detail (the explicit three-pass structure and KV-cache reuse) than the paper states verbatim; treat it as a faithful interpretation rather than the exact implementation.

Pass 1: Pairwise similarity computation.

The retriever encodes each passage into a vector representation, and pairwise cosine similarities are computed:

for i = 1 to N:
h_i = RetrieverEncode(d_i) // last-token pooling
h_i = h_i / ||h_i|| // L2 normalization

S[i, j] = h_i · h_j // cosine similarity matrix, N × N

// Compute attention weights with self-similarity masked out
for i = 1 to N:
S[i, i] = −∞
W = row-wise softmax(S / τ) // τ is a temperature hyperparameter

Here WRN×NW \in \mathbb{R}^{N \times N} is the attention weight matrix, where WijW_{ij} indicates how much passage jj should inform the LM's prediction on passage ii. At initialization the retriever produces near-uniform weights (Wij1N1W_{ij} \approx \frac{1}{N-1} for jij \neq i), reflecting the absence of learned similarity structure.

Pass 2: Baseline language modeling (without cross-document attention).

The LM processes each passage independently under the standard causal objective:

for i = 1 to N:
outputs_i = LM(d_i, use_cache=True)
KV_cache[i] = outputs_i.key_values // save K, V matrices at each layer
L_baseline = outputs_i.loss // standard next-token prediction loss

The key-value (KV) cache for each passage is retained for Pass 3. These cached representations encode the semantic content of each passage as processed by the LM.

Pass 3: Language modeling with cross-document attention.

The LM processes each passage a second time, now augmented with cross-document attention that leverages the KV cache from Pass 2:

for i = 1 to N:
L_cross = LM(
d_i,
cross_attn_weights = W[i], // row i of the similarity matrix
cross_attn_kv = KV_cache // KV cache from all passages
).loss

Within the modified LM, each transformer layer performs two attention operations.

for each layer l = 0, 1, ..., L − 1:
Q, K, V = project(hidden_states)

// (a) Standard causal self-attention (within-document)
s = Attention(Q, K, V, mask = causal)

// (b) Cross-document attention (between-document)
for each pair (i, j) where i ≠ j:
b[i][j] = Attention(
Q = Q[i],
K,V = KV_cache[l][j],
mask = full // no causal constraint across documents
)

// (c) Weighted aggregation via retriever similarity
for each passage i:
b[i] = Σ_{j≠i} W[i,j] · b[i][j]

// (d) Combine self-attention and cross-document attention
output = s + b

// (e) Residual connection and feed-forward
hidden_states = LayerNorm(hidden_states + output)
hidden_states = hidden_states + FFN(hidden_states)

Two design choices are worth noting.

  • Causal vs. full masking. Self-attention within a passage uses a causal mask (each token attends only to preceding tokens), whereas cross-document attention uses a full (bidirectional) mask. This is valid because there is no autoregressive dependency between distinct documents.
  • Hidden state divergence from Pass 2. Although the self-attention operation is structurally identical to Pass 2, the hidden states differ because earlier layers have already incorporated cross-document information, producing different query, key, and value projections.

Gradient flow to the retriever.

The training objective is the cross-entropy loss Lcross\mathcal{L}_{\text{cross}} from Pass 3. Since the attention weights WW are computed from the retriever's output in Pass 1 and used differentiably in Pass 3, gradients flow back through WW to the retriever parameters. Concretely:

  • If attending to passage djd_j reduces Lcross\mathcal{L}_{\text{cross}} for passage did_i, the gradient increases WijW_{ij}, encouraging the retriever to assign higher similarity to the (di,dj)(d_i, d_j) pair.
  • If attending to an irrelevant passage introduces noise and increases the loss, the gradient decreases the corresponding weight.

Over the course of training, the retriever learns to distinguish semantically related passages from unrelated ones, guided solely by the language modeling signal.

Inference

After training, the LM component is discarded; only the retriever encoder is retained.

Offline indexing: Each document in the corpus is encoded into a dense vector and stored in an approximate nearest neighbor index (e.g., FAISS).

Online retrieval: A query is encoded using the same retriever, and the top-kk nearest documents are retrieved via vector similarity search. This inference pipeline is identical to that of other dense retrievers such as DPR or E5.

Additional Reading

SPLARE: Learning Retrieval Models with Sparse Autoencoders. ICLR 2026. Uses SAE latent features instead of vocabulary-space projections as indexing units for learned sparse retrieval. SPLARE-7B achieves strong results on MMTEB multilingual and English retrieval tasks, and the representations are inherently language-agnostic. The main concern is the dependency on the availability and quality of pretrained SAEs; the 7B model size may also limit deployment in latency-sensitive settings.

DISCo: A Dense Subset Index for Collective Query Coverage. ICLR 2026. Formulates retrieval as submodular coverage: instead of ranking documents independently, DISCo finds small subsets whose combined representations collectively cover the query semantics, targeting multi-hop QA and text-to-SQL. Achieves favorable coverage-vs-latency tradeoffs via sublinear retrieval. The submodular formulation adds complexity over standard top-kk retrieval, and the practical gains over strong re-ranking baselines remain unclear.

LightRetriever: A LLM-based Text Retrieval Architecture with Extremely Faster Query Inference. ICLR 2026. Decouples query and document encoding: a lightweight query encoder achieves over 1000×\times speedup in query encoding while a full LLM handles offline document encoding, retaining ~95% retrieval quality with 10×\times end-to-end throughput gain. The asymmetric design is compelling, though the 5% quality gap may compound in downstream pipelines, and effectiveness on out-of-domain queries with the lightweight encoder is not well characterized.

MILCO: Learned Sparse Retrieval Across Languages via a Multilingual Connector. ICLR 2026. Maps multilingual queries and documents into a shared English lexical space via a connector module. Outperforms BGE-M3 and Qwen3-Embed with 3×\times lower retrieval latency and 10×\times smaller index size. A LexEcho head preserves source-language entities that would otherwise be lost in English projection. The reliance on English as a pivot language may introduce biases for low-resource languages with limited English alignment data.

CSRv2: Unlocking Ultra-Sparse Embeddings. ICLR 2026. Pushes sparse embeddings to extreme sparsity (k=2k = 2 to 44 active dimensions) via progressive kk-annealing and supervised contrastive objectives. Achieves 14% accuracy gain at k=2k = 2 and up to 300×\times compute/memory savings over dense embeddings. The ultra-sparse regime is impressive for efficiency, but performance still degrades noticeably at the lowest sparsity levels, and the method has not been validated on retrieval-specific benchmarks beyond classification and matching tasks.

Retrieval-Aware Distillation for Transformer-SSM Hybrids. arXiv, February 2026. Identifies retrieval-critical attention heads (~2% of all heads) and replaces the rest with SSM modules, recovering over 95% of teacher performance with 5×\times to 6×\times memory reduction. A promising direction for efficient long-context models. Key limitations: validated only at 1B to 1.5B scale, evaluated up to 4K context length, and head selection relies on a synthetic probe task whose transferability to complex reasoning is untested.