Needle: A 26M Parameter Model That Runs on Your Phone and Calls Tools Faster Than GPT-4

Karify98 & Amy ๐ŸŒธยท
Cover Image for Needle: A 26M Parameter Model That Runs on Your Phone and Calls Tools Faster Than GPT-4

Why Should You Care About a 26M Parameter Model?

In today's AI landscape, everyone is racing for more parameters โ€” GPT-4 has hundreds of billions, Gemini Ultra has trillions. But today on Hacker News, a project is getting massive attention by going in the completely opposite direction.

Needle โ€” a 26 million parameter model (26M, not 26B) โ€” was just open-sourced by Cactus Compute. This model does exactly one thing: tool calling (function calling). And it does it unbelievably fast โ€” 6000 tokens/sec for prefill, 1200 tokens/sec for decode on mobile devices.

For comparison: GPT-4 running on expensive server GPUs achieves roughly 30-50 tok/s. Needle runs on a phone and is 100x faster.

What Is Tool Calling and Why Does It Matter?

Every time you use an AI agent โ€” Claude Code calling a terminal, Cursor running tests, or Siri setting an alarm โ€” there's a tool calling step: the model identifies which function the user wants to call, extracts parameters, and generates JSON output.

Example:

User: "What's the weather in San Francisco?"
โ†’ Model calls: {"name": "get_weather", "arguments": {"location": "San Francisco"}}

This is a mandatory step in every agentic workflow. Without tool calling, AI is just a chatbot that can talk โ€” it can't interact with the real world.

The Key Insight: Tool Calling Doesn't Need FFN

This is the biggest insight from the Cactus Compute team, and the reason this post was written.

In a standard transformer, roughly 2/3 of parameters live in the FFN (Feed-Forward Network). FFN's role is to "memorize" knowledge โ€” like the model's memory. But tool calling doesn't need memory. It needs:

  1. Align โ€” match the query to the right tool name
  2. Extract โ€” extract parameter values from input
  3. Assemble โ€” create JSON output

All three steps are retrieval-and-assembly โ€” take information from input and piece it together. This is exactly what cross-attention excels at, without needing FFN.

The Cactus team designed an architecture called Simple Attention Networks โ€” the entire model is just attention and gating, with no MLPs at all. A 12-layer bidirectional encoder processes tool definitions, an 8-layer causal decoder generates output. Dimension 512, 8 heads, vocab size 8192.

Result: a model with only 26M parameters that beats much larger models on single-shot function calling:

  • FunctionGemma-270M (270M parameters)
  • Qwen-0.6B (600M parameters)
  • Granite-350M (350M parameters)
  • LFM2.5-350M (350M parameters)

Of course, larger models still excel at complex conversation and reasoning. But for single-shot tool calling โ€” the most common use case in production โ€” Needle is good enough.

How Needle Was Trained

The training process is remarkably simple:

Step 1 โ€” Pretrain: 200B tokens on 16 TPU v6e, taking 27 hours. The model learns language and basic structure.

Step 2 โ€” Post-train: 2B tokens of synthesized function-calling data, taking only 45 minutes. The model learns how to call tools.

The notable part: tool-calling data was generated using Gemini (knowledge distillation). They used Gemini 3.1 Flash Lite as a "teacher" to generate training data, then distilled it into the small model. Total training cost is very low compared to large LLMs.

The dataset covers 15 tool categories: timers, messaging, navigation, smart home, and more. These are real-world use cases for mobile devices.

Why Edge AI Tool Calling Is a Major Trend

1. Latency โ€” The Deciding Factor for UX

When you use Siri or Google Assistant, 500ms-2s of latency is already frustrating. With a model running locally on your phone, inference latency is near zero. No round-trip to servers, no queue, no rate limits.

2. Privacy โ€” Data Never Leaves the Device

Every time you say "call my wife" or "schedule an appointment," that data gets sent to the cloud. With on-device models, everything stays on your phone. This is a huge selling point for enterprise and healthcare.

3. Offline โ€” AI Without Internet

On a plane, in remote areas, or simply when WiFi drops โ€” the on-device model still works. Local tool calling can handle alarms, calculators, unit conversions, and file management without any network connection.

4. Cost โ€” No API Fees

Every GPT-4 API call costs money. With a local model, you can call it as many times as you want. This is especially important for developers building apps with millions of users.

Practical Applications for Developers

Phone Voice Assistant

Instead of sending audio to cloud โ†’ transcribe โ†’ process โ†’ respond (3-4 round trips), you can now run the entire pipeline locally. Needle handles tool calling, speech-to-text models run locally, and results are near real-time.

IoT and Wearables

Smartwatches, earbuds, AR glasses โ€” all have chips powerful enough to run a 26M model. Imagine: "Hey, set a 3 PM appointment" โ†’ the model calls the calendar API right on your watch, no phone needed.

App Development

Developers can integrate Needle into mobile apps instead of calling third-party APIs. Lower latency, lower cost, better privacy. With MIT license and weights on HuggingFace, you can fine-tune for specific use cases.

from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

Fine-Tuning for Your Own Use Case

Another strong point: you can fine-tune Needle on your personal Mac/PC with just a few commands:

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

The playground opens a web UI at http://127.0.0.1:7860 โ€” define tools, generate data, train, evaluate, all in one interface. No GPU cluster needed, no cloud required.

Caveats โ€” Things to Keep in Mind

This post isn't meant to be a PR piece for Cactus Compute. A few honest points:

First, Needle only excels at single-shot function calling. If you need multi-turn conversation, complex reasoning, or handling edge cases โ€” larger models still outperform it.

Second, "smaller = better" isn't always true. Small models can be "finicky" โ€” sensitive to input format, prone to failure on edge cases that larger models handle easily.

Third, 6000 tok/s is impressive but on specific hardware (real-world phone benchmarks are unclear). Production performance may differ.

Fourth, the "no FFN" architecture is still experimental. The detailed paper hasn't been published yet, so there's no peer review.

Conclusion: Small Model, Big Implications

Needle isn't a model that will replace GPT-4 or Claude. But it represents an important trend: AI doesn't have to be large to be useful.

When tool calling โ€” the core component of every AI agent โ€” can run on a 26M parameter device, the door for edge AI agents is wider than ever. Phones, watches, glasses, earbuds โ€” all can become AI agents.

The question for developers: are you ready to build apps with local AI instead of depending on cloud APIs?


References: