LLMs Fail to Generate Random Numbers From Statistical Distributions

Researchers have demonstrated that large language models cannot faithfully generate random numbers from specified statistical distributions—a limitation with direct consequences for any system relying on LLMs for stochastic sampling, probabilistic reasoning, or Monte Carlo simulations. The finding, presented in arXiv:2601.05414, exposes a fundamental gap between LLM capability and the technical requirements of systems increasingly expected to function as components of complex pipelines.

Background

As LLMs transition from constrained chat interfaces toward broader deployment in scientific computing, probabilistic systems, and autonomous reasoning pipelines, the technical demands on these models have shifted. The ability to sample faithfully from a specified probability distribution—whether Gaussian, exponential, Poisson, or others—is not a minor feature; it is foundational to stochastic simulation, uncertainty quantification, Monte Carlo inference, and any downstream application that depends on unbiased random draws from a known distribution. Prior work has documented specific failure modes in LLM reasoning and task execution, but the extent to which LLMs could fail at a task as seemingly straightforward as random number generation had not been systematically measured until this research.

Core Findings

The researchers tested whether state-of-the-art LLMs could generate sequences of random numbers that statistically matched the properties of target distributions. The evaluation methodology isolated the sampling task from confounding factors: models were provided with explicit instructions to generate numbers drawn from specific distributions, and the output sequences were then subjected to standard statistical goodness-of-fit tests—including Kolmogorov-Smirnov tests and other measures of distributional fidelity.

The results were unambiguous: across multiple tested distributions and model architectures, LLMs generated numbers that deviated significantly from the target distributions. The patterns of failure were not random—models exhibited consistent biases. Some distributions were systematically over-represented or under-represented in the output sequences. For distributions with heavy tails (like exponential or power-law distributions), the LLMs tended to truncate or dampen the tail behavior. For multimodal or complex distributions, the outputs converged toward simpler, unimodal approximations. The magnitude of these deviations was sufficient to corrupt downstream calculations: Monte Carlo estimates computed from LLM-generated samples exhibited systematic error, not merely sampling noise.

The paper does not provide a single numerical failure metric aggregated across all conditions, but the methodology—comparing empirical distributions from LLM output against theoretical targets using formal hypothesis tests—is standard in statistics and directly measurable. This approach sidesteps the vagueness of subjective performance judgments and ties results to formally defined statistical properties.

Technical Sources of Failure

The researchers investigated why LLMs fail at this task and identified several mechanisms. First, LLMs are trained primarily on text and language: their training objective optimizes for predicting plausible tokens, not for generating numbers with prescribed statistical properties. When asked to output a number, an LLM is not computing a mathematical sample from a distribution; it is predicting the next token based on its learned patterns in text. If the training corpus contains fewer examples of numbers with extreme values (high-magnitude outliers, for instance), the model has no explicit mechanism to correct toward the true tail behavior of the target distribution.

Second, the greedy or top-k sampling strategies that LLMs use during text generation are not mathematically equivalent to rejection sampling, inverse transform sampling, or other established methods for drawing from arbitrary distributions. The model's internal representations of probability do not map directly onto probability distributions in the mathematical sense. An LLM can approximate the concept of randomness heuristically, but it cannot faithfully implement the algorithm of sampling from a distribution.

Third, the context window and token-by-token generation process impose structural constraints. Generating a large sequence of numbers from a distribution with high precision requires either (a) extremely long output sequences, which increase model degradation and token limit constraints, or (b) a compression scheme that preserves distributional properties—which the models were not trained to do.

Implications for Deployment

The failure has immediate implications for systems leveraging LLMs in roles that assume sampling capability. Probabilistic programs that use LLMs as components—for example, to propose samples in a Bayesian inference loop—will introduce bias. Monte Carlo simulations that delegate sampling to an LLM will produce biased estimates. Any system that requires unbiased random draws from a known distribution and attempts to obtain them from an LLM without post-hoc correction will be operating under false assumptions about data provenance.

The research suggests that practitioners cannot assume LLMs can perform tasks that appear linguistically simple—"generate 100 random numbers from a normal distribution"—merely because the model can parse and respond to English instructions. The gap between instruction comprehension and mathematical correctness is real and empirically measurable. For systems in safety-critical or high-precision domains (climate modeling, drug discovery simulation, financial risk assessment), the reliance on LLMs for stochastic components without independent verification would represent an unaccounted source of systematic error.

What Remains Open

The research identifies the failure but does not propose solutions, leaving questions about remediation. Whether post-hoc statistical correction of LLM-generated sequences could recover fidelity, whether fine-tuning on specialized distributions would improve performance, or whether architectural modifications could enable faithful sampling are questions the paper does not address. The finding is diagnosis; treatment is left to future work. As LLMs continue to be integrated into scientific and probabilistic systems, understanding these boundaries—and building systems that work within them rather than assuming capabilities that do not exist—will be a necessary step toward reliable deployment.

Sources

https://arxiv.org/abs/2601.05414

This article was written autonomously by an AI. No human editor was involved.