InicioTu VozGoogle solved the most difficult problem in AI and almost no one...

Google solved the most difficult problem in AI and almost no one noticed it


Google Research's new TurboQuant algorithm compresses the artificial intelligence cache up to six times without loss of quality REUTERS/Arnd Wiegmann/File
Google Research’s new TurboQuant algorithm compresses the artificial intelligence cache up to six times without loss of quality REUTERS/Arnd Wiegmann/File

Most assume that the costly problem of artificial intelligence is train her. Months of computing, thousands of GPUs, hundreds of millions of dollars. That’s true. But there is another cost, quieter and more everyday, that few see: keep it running.

Every time you interact with ChatGPT, Gemini or Claudethe model needs to remember everything you said in the conversation. That working memory is called KV cacheand grows with each message. In long conversations or long documents, that space becomes huge. Running a large model for 512 users at the same time can consume up to 512 gigabytes of memory in the cache alone. Almost four times what the model itself needs.

That translates into hardware, electricity and a very specific limit about how long a conversation can last before the system crashes or becomes prohibitively expensive.

What Google just changed

(Illustrative Image Infobae)
Temporary memory storage, known as KV cache, consumes up to four times more resources than the AI ​​model itself in prolonged conversations (Illustrative Image Infobae)

On March 24, Google Research public TurboQuant: an algorithm that compress that cache up to six times without losing quality. The result was presented in ICLR 2026the largest machine learning conference of the year.

What’s notable is not just the level of compression. The thing is works without retraining the modelwithout calibrating it, without specific data. It is applied directly on top of what already exists. And in standard benchmarks—text comprehension, code generation, summarization—the compressed model obtained identical results to the original model.

Researchers use the term ‘absolute quality neutrality’. Not approximate. Identical.

The algorithm also showed up to Eight times faster attention calculation than H100 GPUsthe most advanced hardware available today. That number applies to the specific attention component, not the entire inference, but it is still a significant operational difference.

Why it matters beyond the technical

(Illustrative Image Infobae)
Google’s algorithm accelerates the attention calculation on H100 GPUs up to eight times, optimizing the use of the most advanced hardware (Illustrative Image Infobae)

If the cache occupies six times less memory, the same hardware can serve six times more users, hold conversations six times longer either run bigger models on devices with fewer resources. The three options are real, with different balances depending on the case.

Google did not publish official code. Even so, within a few days of announcing the paper, independent developers replicated the results from scratch. One tested the system on a consumer GPU and got bit-for-bit identical responses to the uncompressed model. That doesn’t happen often. It means that the paper says what it says.

There is a silent race to lower the cost of operating AI. Not to build it. To use it every day. This race does not have covers or presentations with applause, but it is what will determine which companies can scale their models and which will discover that the limit is not what they know how to do, but How much does it cost to keep doing it?

The smartest AI in the world is useless if you can’t afford it.





Source link

RELATED ARTICLES

DEJA UNA RESPUESTA

Por favor ingrese su comentario!
Por favor ingrese su nombre aquí

Most Popular

Recent Comments