
Most assume that the costly problem of artificial intelligence is train her. Months of computing, thousands of GPUs, hundreds of millions of dollars. That’s true. But there is another cost, quieter and more everyday, that few see: keep it running.
Every time you interact with ChatGPT, Gemini or Claudethe model needs to remember everything you said in the conversation. That working memory is called KV cacheand grows with each message. In long conversations or long documents, that space becomes huge. Running a large model for 512 users at the same time can consume up to 512 gigabytes of memory in the cache alone. Almost four times what the model itself needs.
That translates into hardware, electricity and a very specific limit about how long a conversation can last before the system crashes or becomes prohibitively expensive.
What Google just changed

On March 24, Google Research public TurboQuant: an algorithm that compress that cache up to six times without losing quality. The result was presented in ICLR 2026the largest machine learning conference of the year.
What’s notable is not just the level of compression. The thing is works without retraining the modelwithout calibrating it, without specific data. It is applied directly on top of what already exists. And in standard benchmarks—text comprehension, code generation, summarization—the compressed model obtained identical results to the original model.
Researchers use the term ‘absolute quality neutrality’. Not approximate. Identical.
The algorithm also showed up to Eight times faster attention calculation than H100 GPUsthe most advanced hardware available today. That number applies to the specific attention component, not the entire inference, but it is still a significant operational difference.
Why it matters beyond the technical

If the cache occupies six times less memory, the same hardware can serve six times more users, hold conversations six times longer either run bigger models on devices with fewer resources. The three options are real, with different balances depending on the case.
Google did not publish official code. Even so, within a few days of announcing the paper, independent developers replicated the results from scratch. One tested the system on a consumer GPU and got bit-for-bit identical responses to the uncompressed model. That doesn’t happen often. It means that the paper says what it says.
There is a silent race to lower the cost of operating AI. Not to build it. To use it every day. This race does not have covers or presentations with applause, but it is what will determine which companies can scale their models and which will discover that the limit is not what they know how to do, but How much does it cost to keep doing it?
The smartest AI in the world is useless if you can’t afford it.
