5 articles

New approaches use statistical inference, rate-distortion theory, and learned eviction to reduce memory cost of long-context LLM inference.

Researchers tackle post-training quantization bottlenecks that distort model behavior under memory and latency constraints.

New framework optimizes total AI costs by accounting for inference-time scaling alongside training.

New quantization techniques accelerate both inference and prompt processing for local model deployment.

New quantization algorithm enables longer context windows and 3.2× memory savings for local inference.