Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model
Cactus-compute released Needle, a 26-million parameter model that distills Google Gemini's tool-calling capabilities into a compact, open-source alternative, potentially reshaping the AI stack by enab
The 26 Million Parameter Needle That Could Reshape the AI Stack
On the surface, the announcement landing on Hacker News this morning looks like just another open-source model release—a small one at that. A team calling themselves cactus-compute has released "Needle," a 26-million parameter model that claims to have distilled Google Gemini's tool-calling capabilities into something that fits in your pocket [1]. But in the context of everything else happening in the AI ecosystem this week—Google's Android 17 rollout, the expansion of Gemini Intelligence into every corner of the mobile experience, and the escalating arms race around on-device AI—this tiny model represents something far more consequential than its size suggests.
The timing is almost too perfect. Just 24 hours before Needle hit GitHub, Google announced that Gemini Intelligence would be baked into Android 17, bringing "the very best of Gemini to our most advanced Android devices" [3]. The Verge reported that Gemini would now appear in Chrome on Android, in autofill suggestions, and "all up in your apps" [3]. Wired confirmed users would soon be able to "generate your own widgets or ask Gemini to finish a booking in Chrome" [2]. And TechCrunch noted that Google was adding Gemini-powered dictation to Gboard, a move that could spell trouble for dedicated dictation startups [4].
This is the context in which Needle arrives—not as a competitor to Gemini, but as something potentially more disruptive: a distillation of its most valuable capability into a form factor that changes the economics of deployment.
The Architecture Behind the Model
Let's examine the technical specifics, because the details matter. Needle is a 26-million parameter model specifically focused on tool calling—the ability for an AI system to invoke external functions, APIs, and services programmatically [1]. This is not a general-purpose chatbot. It's not trying to be the next GPT-4 or Claude. It's a surgical extraction of one specific capability from Google's Gemini, compressed into a model that is orders of magnitude smaller than its source.
The significance of 26 million parameters cannot be overstated. For context, Daily Neural Digest tracks 515 AI models, and the vast majority of capable models today operate in the billions of parameters. A 26M model is roughly 0.026% the size of a 100-billion parameter model. This is the difference between requiring a datacenter GPU cluster and running comfortably on a microcontroller. The sources do not specify the exact distillation technique used, but the implication is clear: the team at cactus-compute has found a way to isolate and replicate the tool-calling capability of Gemini without replicating Gemini's entire reasoning architecture.
This represents a fundamentally different approach to model development. Most of the industry has been chasing scale—bigger models, more parameters, more training data. Needle embodies the opposite philosophy: extreme specialization and extreme compression. The question is whether tool calling, which inherently requires understanding complex instructions and generating structured outputs, can survive such aggressive distillation. The team claims it can, and the open-source community will now be the judge.
The Gemini Ecosystem Context
To understand why Needle matters, you must understand what Google is doing with Gemini Intelligence right now. The announcements from Google's pre-I/O Android showcase paint a picture of a company embedding its AI into the operating system itself. The Verge's reporting emphasizes that Gemini is "all about controlling your phone" [3]. This isn't just about answering questions—it's about taking actions. Finishing a booking in Chrome. Generating widgets. Controlling apps. These are all tool-calling scenarios.
Google's strategy is clear: Gemini Intelligence is becoming the orchestration layer for the Android experience. Wired's coverage confirms that users will be able to "ask Gemini to finish a booking in Chrome on Android" [2]. This requires the AI to understand the user's intent, navigate the Chrome interface, interact with a booking system, and complete a transaction. That's a complex tool-calling pipeline, and it's exactly the kind of capability Needle claims to have distilled.
But here's where the strategic tension emerges. Google is building Gemini Intelligence as a cloud-dependent service. The TechCrunch piece notes that Gemini-powered dictation in Gboard will "initially launch with Samsung Galaxy and Google Pixel phones" [4], suggesting a controlled, premium rollout. The Verge confirms that Gemini Intelligence "brings the very best of Gemini to our most advanced Android devices" [3]. The implication is that not all devices will get the full experience, and that the most capable tool-calling features may require a persistent internet connection and access to Google's servers.
Needle, by contrast, is a 26M parameter model that could theoretically run entirely on-device, with no cloud dependency, no API costs, and no privacy concerns about sending user data to Google's servers. This follows the classic open-source disruption pattern: take a proprietary capability, distill it into a portable form factor, and make it freely available.
The Developer Friction and Deployment Reality
For developers building AI-powered applications, the current state of tool calling is a mess. Every major model provider—OpenAI, Anthropic, Google, Meta—has its own tool-calling format, its own API conventions, and its own pricing structure. If you want to build an application that can book a flight, update a calendar, send an email, and query a database, you're currently locked into whichever model provider you choose. The switching costs are enormous.
Needle changes this calculus. By distilling Gemini's tool-calling capability into a standalone 26M model, the cactus-compute team has created something that could serve as a universal tool-calling engine. Developers could potentially use Needle as a lightweight front-end that handles tool invocation, while routing more complex reasoning tasks to larger models. This modular approach to AI architecture has been discussed for years but rarely implemented in practice.
The sources do not specify Needle's inference speed, memory footprint, or hardware requirements. These details are not yet public. But the model size alone tells us something: 26 million parameters can fit in roughly 100 megabytes of memory at 32-bit precision, and far less with quantization. This is small enough to run on a smartphone, a Raspberry Pi, or even an embedded device. The implications for edge computing, IoT, and privacy-sensitive applications are substantial.
However, a critical question remains unanswered: how faithful is the distillation? Tool calling requires precise output formatting. A model that hallucinates function names, misorders parameters, or generates invalid JSON is worse than useless—it's dangerous. The sources do not provide benchmark results comparing Needle's tool-calling accuracy to Gemini's. Without this data, the model remains an intriguing proof of concept rather than a production-ready tool.
Winners, Losers, and the Shifting Economics of AI
The immediate winners in this scenario are developers and startups building AI-powered applications. If Needle delivers on its promise, it dramatically lowers the barrier to entry for tool-calling functionality. Instead of paying per-token API fees to Google or OpenAI for every tool invocation, developers could run Needle locally for effectively zero marginal cost. This is particularly relevant for applications that require high-frequency tool calls—think automated trading bots, monitoring systems, or personal assistants that interact with dozens of services throughout the day.
The losers are more nuanced. Google itself may not face direct harm from Needle—the company's strategy focuses on ecosystem lock-in, not per-token revenue. But the existence of a high-quality open-source tool-calling model undermines Google's narrative that Gemini Intelligence requires their cloud infrastructure. The Verge's coverage emphasizes that Gemini is being positioned as the central intelligence layer for Android [3]. If developers can replicate a key subset of that capability with a 26M open-source model, Google's value proposition becomes weaker.
The dictation startups that TechCrunch identified as potential casualties of Google's Gboard integration [4] might find an unexpected lifeline in Needle. If they can run tool-calling locally without paying Google's API fees, they could potentially compete on features rather than infrastructure costs. But this assumes Needle is reliable enough for production use—an assumption that remains unverified.
The broader industry trend here is the commoditization of AI capabilities. We've seen this pattern before with image recognition, natural language processing, and speech-to-text. Each capability starts as a proprietary, cloud-dependent service, then gets distilled into smaller, open-source models, and eventually becomes a standard library function that runs on any device. Tool calling appears to be following the same trajectory, and Needle may be the inflection point.
What the Mainstream Coverage Is Missing
The Wired, The Verge, and TechCrunch articles from yesterday all focus on Google's announcements—the features, the partnerships, the competitive dynamics with Apple and Samsung. None of them mention Needle, which is understandable given that it launched a day later. But the juxtaposition reveals something important that the mainstream coverage is missing: the most disruptive force in AI may not be the next frontier model from Google or OpenAI, but the distillation of existing capabilities into deployable form factors.
The Verge's piece frames Gemini Intelligence as a breakthrough because it can "control your phone" [3]. Wired emphasizes the convenience of having Gemini "finish a booking in Chrome" [2]. These are real capabilities that matter for end users. But from a developer perspective, the interesting question is not whether Google can build these features—it's whether anyone else can build them without Google's infrastructure. Needle suggests that the answer may be yes.
There is also a privacy angle that the mainstream coverage underplays. Google's Gemini Intelligence, as described in these articles, requires sending user data to Google's servers for processing. Even with on-device components, the most complex tool-calling operations—like completing a booking—likely require cloud inference. Needle, if it works as claimed, could enable the same capabilities entirely on-device, with no data leaving the user's phone. In an era of increasing regulatory scrutiny around data privacy, this is a significant advantage that the mainstream coverage has not addressed.
The sources also do not address the training data or licensing implications. Needle is described as a distillation of Gemini's tool-calling capability [1], but the sources do not specify whether this distillation used Google's APIs, whether it required access to Gemini's weights, or whether it was achieved through black-box distillation techniques. The legal and ethical questions around model distillation remain unresolved, and Needle may force a conversation that the industry has been avoiding.
The Strategic Bet
The cactus-compute team is making a specific bet: that tool calling is a separable capability that can be extracted, compressed, and deployed independently of the reasoning engine that originally generated it. This is not obvious. Tool calling requires understanding natural language, mapping it to structured function signatures, handling edge cases, and generating valid outputs. These are reasoning tasks, and it's not clear that they can be cleanly separated from general intelligence.
But if the bet pays off, the implications extend far beyond Needle itself. Every capability that can be distilled becomes a potential open-source building block. Code generation. Data analysis. Image understanding. Each of these could be extracted from larger models and packaged into lightweight, specialized models that run anywhere. The monolithic AI model becomes a collection of specialized modules, each optimized for a specific task, each small enough to deploy on edge devices.
This is the direction the industry has been moving, but Needle represents a significant acceleration. The sources do not specify whether the team plans to release the distillation methodology, the training code, or the evaluation benchmarks. These details will determine whether Needle is a one-off experiment or the beginning of a new paradigm.
For now, what we have is a 26-million parameter model that claims to do something that currently requires a multi-billion parameter cloud service. If it works, it changes the economics of AI deployment. If it doesn't, it's still a fascinating data point in the ongoing compression of intelligence. Either way, the conversation about what belongs in the cloud and what belongs on the device just got a lot more interesting.
The needle has been threaded. Now we wait to see what it sews.
References
[1] Editorial_board — Original article — https://github.com/cactus-compute/needle
[2] Wired — The Top New Features in Google’s Android 17—and Gemini Intelligence—Coming This Summer — https://www.wired.com/story/android-17-gemini-top-new-features/
[3] The Verge — Gemini’s latest updates are all about controlling your phone — https://www.theverge.com/tech/928724/gemini-intelligence-android-io-autofill
[4] TechCrunch — Google adds Gemini-powered dictation to Gboard, which could be bad news for dictation startups — https://techcrunch.com/2026/05/12/google-adds-gemini-powered-dictation-to-gboard-which-could-be-bad-news-for-dictation-startups/
Was this article helpful?
Let us know to improve our AI generation.
Related Articles
A conversation with Kevin Scott: What’s next in AI
In a late 2022 interview, Microsoft CTO Kevin Scott calmly discussed the next phase of AI without product announcements, offering a prescient look at the long-term strategy behind the generative AI ar
Fostering breakthrough AI innovation through customer-back engineering
A growing body of evidence shows that enterprise AI innovation is broken when focused solely on algorithms and infrastructure, so this article explains how customer-back engineering—starting with user
Google detects hackers using AI-generated code to bypass 2FA with zero-day vulnerability
On May 13, 2026, Google's Threat Analysis Group confirmed state-sponsored hackers used AI-generated exploit code to weaponize a zero-day vulnerability, bypassing two-factor authentication on Google ac