Skip to main content

What Perplexity's Search Architecture Reveals About the Future of the Internet

Zeyu Yang
PhD student at Rice University

Perplexity published a technical report on their production search infrastructure, which now handles 200 million queries daily. The most interesting part is not the benchmark numbers (they win). It is the thesis baked into every design decision: search built for AI models is a fundamentally different system than search built for humans. This distinction has deep implications for what the internet becomes next.

Solving BrowseComp: Three Paths to Building Better Search Agents

Zeyu Yang
PhD student at Rice University

BrowseComp is one of the hardest benchmarks for LLM-based search agents. It requires deep, multi-hop web research where the agent must plan, search, read, and synthesize across dozens of interactions. Three recent papers attack this problem from fundamentally different angles: context management, data quality, and verification. Together they paint a clear picture of what it takes to build a frontier search agent today.

Close Read: LeWorldModel, a JEPA That Trains From Pixels Without the Tricks

Zeyu Yang
PhD student at Rice University

LeWorldModel (LeWM) claims to be the first Joint-Embedding Predictive Architecture that trains stably end to end from raw pixels using only two loss terms: next-embedding prediction plus a single regularizer that forces the latent distribution to be an isotropic Gaussian. The claim mostly holds, and the reason it holds is the cleanest idea in the paper: replace the usual pile of anti-collapse heuristics (stop-gradient, EMA, frozen foundation encoders, seven-term VICReg objectives) with one distribution-matching penalty borrowed from LeJEPA. The headline "one hyperparameter" is real for the loss, but it quietly leans on architectural and quadrature choices that are themselves tuned. This is a close read of the paper from the first equation to the last.

Scalable Synthetic Data Generation with LLMs

Zeyu Yang
PhD student at Rice University

Human data annotation is expensive, and projections suggest the stock of high-quality public text could be exhausted within this decade (Villalobos et al., 2024). Synthetic data lets you precisely target capability gaps. The question is not whether to use synthetic data, but how to do it without collapsing into repetitive, low-diversity slop. Here is a survey of the key paradigms and the best open-source tooling to put them into practice.


References

Taming the Wikipedia Category Graph: SOTA in Compression and Construction

Zeyu Yang
PhD student at Rice University

Why bother? Wikipedia's raw Category: system is not actually a usable taxonomy: it is a cyclic graph that mixes is-a, part-of, and topical relations, conflates instances with classes, and has no consistent typing, so feeding it directly to a knowledge graph, an entity-typing system, or a hierarchical retriever produces garbage. Construction methods turn this soup into a clean DAG that you can actually reason over (x is-a-kind-of y), which is what powers entity typing in YAGO/CaLiGraph, hierarchical RAG, and ontology-grounded LLM agents. Compression matters because even the cleaned graph plus full article membership has billions of edges, and you want it to fit in memory next to your retriever or to be encoded into low-dimensional embeddings for fast subsumption queries.

Close Read: When Does LeJEPA Learn a World Model?

Zeyu Yang
PhD student at Rice University

The claim: train a representation to pull positive pairs together while forcing its embeddings to be an isotropic Gaussian, and (in a Gaussian world with Ornstein-Uhlenbeck transitions) the only way to win is to recover the true latent variables up to a rotation. The paper proves this is an if and only if: the Gaussian latent distribution is the unique choice for which LeJEPA is linearly identifiable. My verdict: the forward theorem is clean, correct, and genuinely illuminating; the converse and the "Lean-verified" framing are weaker than they sound, because the load-bearing analysis facts are assumed rather than proven, and the central Gaussian-world assumption is exactly the one their own robotics experiment violates.

Close Read: stable-worldmodel, an Infrastructure Bet on Reproducible World-Model Research

Zeyu Yang
PhD student at Rice University

stable-worldmodel (swm) argues that the bottleneck in world-model research is no longer ideas but plumbing: every lab re-implements the same encoder, predictor, CEM planner, and data loader, and the inconsistencies between those copies make published comparisons untrustworthy. The paper's fix is a single PyTorch and Gymnasium platform built on three abstractions (World, Policy, Solver), a Lance-based data layer that loads multimodal trajectories 3 to 4 times faster than HDF5 or MP4, and a factors-of-variation system that turns any environment into a controlled out-of-distribution (OOD) test. The infrastructure claims are concrete and well-supported. The scientific headline, that current world models are brittle under mild distribution shift, is real but rests almost entirely on a single environment (Push-T). This is a close read of the paper from the data layer to the last solver.

Recent Progress on Information Retrieval

Zeyu Yang
PhD student at Rice University

Reading across a selection of papers from ICLR 2026 (plus one arXiv preprint), a single argument emerges: 2026 information retrieval work is converging on a shared strategy of pushing cost off the query and online path. Whether the lever is data (Revela trains with far less of it), query-time compute (LightRetriever shrinks the query encoder), index footprint (MILCO and CSRv2 attack representation cost from opposite ends), or model size (the Transformer-SSM hybrid prunes attention to a retrieval-critical core), the recurring move is to relocate expense to where it is cheapest to absorb (offline indexing, fewer labels, or a smaller serving model) rather than to make retrieval itself fundamentally smarter.

The papers below navigate this shared goal from different angles. The table groups them by theme and names the tradeoff each one accepts in exchange for its efficiency gains.

ThemePapersWhere cost is movedTradeoff accepted
Cheaper supervisionRevelaFrom labeled pairs to an unsupervised language-modeling signalCaps out at unsupervised quality; no annotated relevance signal
Cheaper query-time computeLightRetrieverFrom the online query encoder to offline document encodingRoughly 5% quality gap; out-of-domain behavior less characterized
Cheaper sparse representationsCSRv2, MILCOCSRv2 to ultra-sparse dimensions; MILCO to a shared English lexical spaceCSRv2 degrades at the lowest sparsity; MILCO inherits English-pivot bias
Cheaper indexing unitsSPLARE, DISCoSPLARE to SAE latents (also yielding language-agnostic representations); DISCo to submodular subset coverageSPLARE depends on pretrained SAE quality; DISCo adds formulation complexity
Cheaper serving modelTransformer-SSM hybridFrom full attention to a retrieval-critical head subset plus SSMsValidated only at small scale and short context

What Is Research Taste?

Zeyu Yang
PhD student at Rice University

Research is not about following a rigid plan. It is about exploring, finding signals, and turning those signals into ideas worth sharing. What follows are the personal heuristics I have come to rely on, not universal laws. Many of them have well-known precedents (Hamming's "You and Your Research" is the obvious one); what I want to add is the reasoning behind each one and how I weigh it when making decisions, rather than treating any of them as a rule to follow blindly.