4 articles

New approaches use statistical inference, rate-distortion theory, and learned eviction to reduce memory cost of long-context LLM inference.

KV cache and tokenizer bugs squashed. Local inference actually viable now.

The framework now supports aggressive KV-cache compression, making on-device models faster to run.

New quantization algorithm enables longer context windows and 3.2× memory savings for local inference.