Taming the Wikipedia Category Graph: SOTA in Compression and Construction
Why bother? Wikipedia's raw Category: system is not actually a usable taxonomy: it is a cyclic graph that mixes is-a, part-of, and topical relations, conflates instances with classes, and has no consistent typing, so feeding it directly to a knowledge graph, an entity-typing system, or a hierarchical retriever produces garbage. Construction methods turn this soup into a clean DAG that you can actually reason over (x is-a-kind-of y), which is what powers entity typing in YAGO/CaLiGraph, hierarchical RAG, and ontology-grounded LLM agents. Compression matters because even the cleaned graph plus full article membership has billions of edges, and you want it to fit in memory next to your retriever or to be encoded into low-dimensional embeddings for fast subsumption queries.