
Follow ZDNET: Add us as a preferred source on Google.
ZDNET's key takeaways
- Google's TurboQuant can dramatically reduce AI memory usage.
- TurboQuant is a response to the spiraling cost of AI.
- A positive outcome is making AI more accessible by lowering inference costs.
With the cost of artificial intelligence skyrocketing thanks to soaring prices for computer components such as memory, Google last week responded with a proposed technical innovation called TurboQuant.
TurboQuant, which Google researchers discussed in a blog post, is another DeepSeek AI moment, a profound attempt to reduce the cost of AI. It could have a lasting benefit by reducing AI's memory usage, making models much more efficient.
Also: What is DeepSeek AI? Is it safe? Here's everything you need to know
Even so, just as DeepSeek did not stop massive investment in AI chips, observers say TurboQuant will likely lead to continued growth in AI investment. It's the Jevons paradox: Make something more efficient, and it ends up increasing overall usage of that resource.
However, TurboQuant is an approach that may help run AI locally by slimming the hardware demands of a large language model.
More memory, more money
The big cost factor for AI at the moment — and probably for the foreseeable future — is the ever-greater use of memory and storage technologies. AI is data-hungry, introducing a reliance on memory and storage unprecedented in the history of computing.
TurboQuant, first described by Google researchers in a paper a year ago, employs “quantization” to reduce the number of bits and bytes required to represent the data.
Also: Why you'll pay more for AI in 2026, and 3 money-saving tips to try
Quantization is a form of data compression that uses fewer bits to represent the same value. In the case of TurboQuant, the focus is on what's called the “key-value cache,” or, for shorthand, “KV cache,” one of the biggest memory hogs of AI.
When you type into a chatbot such as Google's Gemini, the AI has to compare what you've typed to a repository of measures that serve as a kind of database.
The thing that you type is called the query, and it is matched against data held in memory, called a key, to find a numeric match. Basically, it's a similarity score. The key is then used to retrieve from memory exactly which words should be returned to you as the AI's response, known as the value.
Normally, every time you type, the AI model must calculate a new key and value, which can slow the whole operation. To speed things up, the machine retains a key-value cache in memory to store recently used keys and values.
The cache then becomes its own problem: The more you work with a model, the more memory the key-value cache takes up. “This scaling is a significant bottleneck in terms of memory usage and computational speed, especially for long context models,” according to Google lead author Amir Zandieh and colleagues.
Also: AI isn't getting smarter, it's getting more power hungry – and expensive
Making things worse, AI models are increasingly being built with more complex keys and values, known as the context window. That gives the model more search options, potentially improving accuracy. Gemini 3, the current version, made a big leap in context window to one million tokens. Prior state-of-the-art models such as OpenAI's GPT-4 had a context window of just 32,768 tokens. A larger context window also increases the amount of memory a key-value cache consumes.
Speeding up quantization for real-time
The solution to that expanding KV cache is to quantize the keys and the values so the whole thing takes up less space. Zandieh and team claim in their blog post that the data compression is “massive” with TurboQuant. “Reducing the KV cache size without compromising accuracy is essential,” they write.
Quantization has been used by Google and others for years to slim down neural networks. What's novel about TurboQuant is that it's meant to quantize in real time. Previous compression approaches reduced the size of a neural network at compile time, before it is run in production.
Also: Nvidia wants to own your AI data center from end to end
That's not good enough, observed Zandieh. The KV cache is a living digest of what's learned at “inference time,” when people are typing to an AI bot, and the keys and values are changing. So, quantization has to happen fast enough and accurately enough to keep the cache small while also staying up to date. The “turbo” in TurboQuant implies this is a lot faster than traditional compile-time quantization.
Two-stage approach
TurboQuant has two stages. First, the queries and keys are compressed. This can be done geometrically because queries and keys are vectors of data that can be depicted on an X-Y graph as a line, which can be rotated on that graph. They call the rotations “PolarQuant.” By randomly trying different rotations with PolarQuant and then retrieving the original line, they find a smaller number of bits that still preserves accuracy.
As they put it, “PolarQuant acts as a high-efficiency compression bridge, converting Cartesian inputs into a compact Polar ‘shorthand' for storage and processing.”
The compressed vectors still produce errors when the comparison is performed between the query and the key, which is known as the “inner product” of two vectors. To fix that, they use a second method, QJL, introduced by Zandieh in 2024. That approach keeps one of the two vectors in its original state, so that multiplying a compressed (quantized) vector with an uncompressed vector serves as a test to improve the accuracy of the multiplication.
They tested TurboQuant by applying it to Meta Platforms's open-source Llama 3.1-8B AI model, and found that “TurboQuant achieves perfect downstream results across all benchmarks while reducing the key value memory size by a factor of at least 6x” — a six-fold reduction in the amount of KV cache needed.
The approach also differs from other methods for compressing the KV cache, such as the approach taken last year by DeepSeek, which constrained key and value searches to speed up inference.
Also: DeepSeek claims its new AI model can cut the cost of predictions by 75% – here's how
In another test, using Google's Gemma open-source model and models from French AI startup Mistral, “TurboQuant proved it can quantize the key-value cache to just 3 bits without requiring training or fine-tuning and causing any compromise in model accuracy,” they wrote, “all while achieving a faster runtime than the original LLMs (Gemma and Mistral).”
“It is exceptionally efficient to implement and incurs negligible runtime overhead,” they observed
Will AI be any cheaper?
Zandieh and team expect TurboQuant to have a significant impact on the production use of AI inference. “As AI becomes more integrated into all products, from LLMs to semantic search, this work in fundamental vector quantization will be more critical than ever,” they wrote.
Also: Want to try OpenClaw? NanoClaw is a simpler, potentially safer AI agent
But will it really reduce the cost of AI? Yes and no.
In an age of agentic AI, programs such as OpenClaw software that operate autonomously, there are a lot of parts to AI besides just the KV cache. Other uses of memory, such as retrieving and storing database records, will ultimately affect an agent's efficiency over the long term.
Those who follow the AI chip world last week argued that just as DeepSeek AI's efficiency didn't slow AI investment last year, neither will TurboQuant.
Vivek Arya, a Merrill Lynch banker who follows AI chips, wrote to his clients who were worried about DRAM maker Micron Technology that TurboQuant will simply make more efficient use of AI. The “6x improvement in memory efficiency [will] likely [lead] to 6x increase in accuracy (model size) and/or context length (KV cache allocation), rather than 6x decrease in memory,” wrote Arya.
Also: AI agents of chaos? New research shows how bots talking to bots can go sideways fast
What TurboQuant can do, though, is make some individual instances of AI more economical, especially for local deployment.
For example, a swelling KV cache and longer context windows may prove less of a burden when running some AI models on limited hardware budgets. That will be a relief for users of OpenClaw who want their MacBook Neo or Mac mini to serve as a budget local AI server.








