Benchmarks & Evaluation
Benchmarks & Evaluation
Standard Benchmarks
Evaluating efficient architectures requires benchmarks that measure both quality and efficiency, ideally on realistic tasks at meaningful scale.
MLPerf (Mattson, 2020) provides the industry-standard benchmarks for training and inference performance, measuring wall-clock time to train to a target quality or inference throughput at target latency across different hardware and software stacks. MLPerf's strength is its emphasis on reproducibility and fair comparison across systems, though the benchmark tasks (image classification, object detection, language modeling, recommendation, speech recognition, and reinforcement learning) may not capture the full diversity of practical workloads. MLPerf has driven significant systems-level optimizations: NVIDIA's A100 and H100 submissions show 2-4x speedups per generation on the same benchmarks, attributable roughly equally to hardware improvements, software optimization (cuDNN kernels, compiler improvements), and algorithmic innovations (FlashAttention, fused operators).
HELM (Holistic Evaluation of Language Models) (Liang et al., 2023) (Liang et al., 2023) evaluates language models across a broad set of tasks (42 scenarios covering knowledge, reasoning, summarization, etc.) with controlled compute budgets, providing the most comprehensive quality benchmark for LLMs. While HELM primarily measures quality rather than efficiency, the combination of HELM scores with efficiency metrics enables meaningful comparisons of quality-per-FLOP across architectures.
Long Range Arena (LRA) (Tay et al., 2021) (Tay et al., 2021) was specifically designed to evaluate efficient sequence models on tasks requiring long-range dependencies. LRA includes six tasks with sequence lengths up to 16K tokens (ListOps, text classification, document retrieval, image classification from pixel sequences, pathfinder, and PathX). S4 and Mamba achieved breakthrough results on LRA, particularly on PathX (length 16K), where standard Transformers achieve random chance. However, LRA has been criticized for not correlating well with performance on practical tasks like language modeling.
RULER (Hsieh et al., 2024) (Hsieh et al., 2024) provides a synthetic benchmark for evaluating long-context capabilities, testing retrieval, multi-hop reasoning, and aggregation over configurable context lengths. RULER enables controlled evaluation of how well different architectures maintain quality as context length increases -- a critical metric for efficient long-context architectures.
LMSys Chatbot Arena provides real-world quality evaluation through human preference ratings, and the serving infrastructure behind it (using vLLM) has driven practical innovations in inference efficiency. While not a formal efficiency benchmark, the economic pressure of serving millions of arena conversations has motivated significant optimization work.
MTEB (Massive Text Embedding Benchmark) evaluates embedding models across diverse tasks and is relevant for evaluating efficient architectures for retrieval and embedding applications, where inference throughput (embeddings per second) is a key metric.
Evaluation Challenges
Efficiency evaluation is complicated by several factors:
Hardware dependence: An architecture efficient on one GPU may not be on another. Sparse attention patterns that are efficient on GPUs with hardware sparse matrix support may be slower than dense attention on GPUs without it. MoE models that excel on multi-GPU systems with high-bandwidth interconnects may be impractical on single GPUs. Any efficiency claim must be qualified by the target hardware.
Theoretical vs. practical complexity: Two algorithms with the same asymptotic complexity can differ by orders of magnitude in practice. FlashAttention has the same O(n^2) theoretical complexity as standard attention but is 2-4x faster due to IO-aware design. Constant factors, memory access patterns, and hardware utilization dominate at practical scales.
Throughput vs. latency: High-throughput batched inference (maximizing tokens/second across all requests) and low-latency single-query inference (minimizing time-to-first-token for a single request) often require different optimizations. Speculative decoding improves latency but not throughput; larger batch sizes improve throughput but increase latency. Benchmarks should specify which metric they optimize.
Apples-to-apples comparison: Comparing architectures requires controlling for confounding factors: model size, training data, training compute, implementation quality, and hardware. A model that appears more efficient may simply be better implemented. The most reliable comparisons train multiple architectures from scratch on the same data with the same compute budget, but this is expensive and rarely done comprehensively.
Quality-efficiency Pareto frontier: No single number captures efficiency. The meaningful comparison is between quality-efficiency Pareto frontiers -- the set of architectures that are not dominated by any other architecture in the quality-efficiency space. A model is truly more efficient only if it shifts the Pareto frontier. The Chinchilla scaling law (Hoffmann et al., 2022) established one such frontier: for a fixed compute budget, the optimal model size and training data size are roughly equal (in tokens). Architectures that achieve the same quality with less total compute are genuinely more efficient.
Practical Efficiency Metrics
Beyond benchmark scores, several practical metrics drive efficiency research in production settings:
- Time-to-first-token (TTFT): The latency from request submission to the first generated token. Critical for interactive applications (chatbots, code completion). TTFT is dominated by prompt processing time, making efficient prefill computation (FlashAttention, prompt caching) essential for long-context applications.
- Tokens per second per GPU: The throughput metric for serving systems. A single H100 GPU can serve approximately 2,000-5,000 tokens/second for a 7B parameter model, depending on batch size and sequence length. For larger models requiring multi-GPU inference, the tokens/second/dollar metric becomes more relevant.
- Memory per token: The KV-cache memory consumption determines the maximum batch size and context length. Standard attention requires O(n * d) memory per layer, where n is the sequence length and d is the head dimension. Techniques like multi-query attention (Shazeer, 2019), grouped-query attention (Ainslie et al., 2023), and KV-cache compression can reduce this by 4-8x.
- Cost per query: The aggregate metric combining throughput, hardware cost, and utilization. For ChatGPT-scale services processing billions of queries per month, reducing cost-per-query by even 10% represents millions of dollars in savings annually.
These practical metrics often diverge from theoretical complexity measures: an algorithm with better asymptotic complexity may have worse practical performance due to constant factors, memory access patterns, or poor hardware utilization. This gap between theory and practice makes empirical benchmarking essential for efficiency claims.
References
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP.
- Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch (2022). Training Compute-Optimal Large Language Models. NeurIPS.
- Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Boris Ginsburg (2024). RULER: What's the Real Context Size of Your Long-Context Language Models?. arXiv.
- Percy Liang, Rishi Bommasani, Tony Lee (2023). Holistic Evaluation of Language Models. TMLR.
- Peter Mattson (2020). MLPerf Training Benchmark. MLSys.
- Noam Shazeer (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv.
- Yi Tay, Mostafa Dehghani, Samira Abnar (2021). Long Range Arena: A Benchmark for Efficient Transformers. ICLR.