LLM's on iOS

How AI Actually Works Right Now

I think most people do not really consider what happens when they type a question into ChatGPT or Gemini. Your words do not get processed on your phone or laptop. They get sent across the internet to a server, a powerful computer inside a warehouse-sized data center, where the AI model breaks your text into tokens, small chunks of words it can process. The word "understanding" becomes "understand" and "ing." A short question might be 10 to 20 tokens, and every one of them gets transmitted to the server, processed, and sent back. That is your data leaving your device every time you send a message, whether you realize it or not.

What this means is that you need an internet connection, someone else's hardware is reading everything you type, and the company running those servers can store, analyze, or sell that data. There is, however, a fundamentally different approach gaining traction: on-device AI. Instead of relying on a remote server, one runs the model directly on the phone or laptop in front of them. Nothing leaves the device. Apple has been quietly building the tools for this through an open-source framework called MLX, and in my opinion it is further along than most people realize.

Cloud vs. on-device AI

What Is MLX?

MLX is a machine learning framework Apple open-sourced under the MIT license, designed specifically for Apple Silicon. The reason Apple Silicon matters comes down to RAM, which is the short-term workspace your device uses to hold whatever it is actively working with. I find the easiest way to think about it is like a desk: the bigger the desk, the more you can spread out at once. An AI model needs to fit entirely in RAM while it runs, which makes RAM size the single most important constraint for on-device AI.

On most computers, the CPU and GPU have separate memory pools, and data gets copied back and forth between them inefficiently. Apple's chips use unified memory, where both share one pool, so nothing needs to be copied. MLX was built to take advantage of this directly. There is also a growing MLX Community on Hugging Face with thousands of models already converted and ready to use.

Traditional separate memory vs. Apple's unified memory

How It Works in Practice

This is not theoretical. In January 2025, developer Christopher Charles (@cristofrcharles) demonstrated a language model running on an iPhone using Apple's mlx-swift-examples app, and Apple engineer Awni Hannun published a technical write-up explaining how it works. The process is fairly straightforward: you download an open-source model like Microsoft's Phi-4 to the phone's storage once, and from there the app loads the model weights into unified memory and generates responses one token at a time using just the phone's chip. No server is involved at any point, and it works in airplane mode.

Making Models Fit

A model like Phi-4 has billions of parameters stored as 16-bit numbers, which adds up to far more memory than any phone has. The workaround is quantization, which compresses those weights down to 4-bit integers. I think the simplest comparison is compressing a high-resolution photo into a JPEG: the file shrinks dramatically and still looks fine, even though some detail is lost. Quantization cuts memory usage by about 75%, so what would normally need 8 GB fits into around 2 GB.

Models are now being designed with this constraint in mind from the start. Google's Gemma 4 (April 2026) includes variants built specifically for edge devices, with the smallest being around 2 billion parameters, small enough to run on a Raspberry Pi, a barebones computer the size of a credit card that costs $35. If a $35 board with 8 GB of RAM can handle a real language model, an iPhone with the same RAM and a far more powerful chip manages it easily. These models are open source under the Apache 2.0 license, and when companies of Google's caliber are investing this heavily in making models run on hardware this small, I believe it says a great deal about the direction things are heading.

Why Bother?

The benefits are fairly self-evident: privacy, speed, offline access, and zero cost since both the models and framework are open source. But there is also an environmental angle that I think deserves attention. Cloud AI runs on data centers that consume enormous amounts of electricity and water. A single ChatGPT query uses roughly ten times the energy of a Google search, and the cooling systems for those GPU clusters drain millions of gallons of water per year. When you run a model on your own device, none of that infrastructure is involved.

Privacy

Nothing leaves the phone. No logs, no third-party access to your prompts.

Speed

No network latency. Responses generate locally without waiting on servers.

Offline

Works in airplane mode, on the subway, anywhere without a connection.

Cost

No API fees or subscriptions. The models and framework are free.

Environment

No data centers burning electricity and water. Your chip draws watts, not megawatts.

What Can Run on Your iPhone?

Not every iPhone can handle this equally well. RAM is the constraint, and different iPhones have different amounts.

Device	Chip	RAM	Max Model (4-bit)	Experience
iPhone 16 Pro	A18 Pro	8 GB	~7B parameters	Excellent
iPhone 15 Pro	A17 Pro	8 GB	~7B parameters	Excellent
iPhone 15	A16	6 GB	~3B parameters	Moderate
iPhone 14	A15	6 GB	~3B (tight)	Limited
iPhone 13 & older	A15 / older	4 GB	Not practical	Not Supported

The Cost of Running AI on Your Phone

The benefits are genuine, but running a model locally has trade-offs I think are worth being honest about. A Greenspector study found that local inference consumes roughly 29 times more energy per query than sending it to ChatGPT. Phones running models like Llama 3.2 drained completely in under two hours of continuous use. A phone that normally lasts all day could die before lunch.

There is also the issue of thermal throttling. Phones have no fans, just a thin metal chassis dissipating heat passively. A 2026 study on mobile LLM inference showed the iPhone 16 Pro drops from 40 tokens per second to 22 within just two inference rounds, a 44% reduction, spending about 65% of sustained use in a throttled state. On top of that, even the newest iPhones max out at 8 GB of RAM, which limits you to 3 to 7 billion parameter models after quantization. Something the scale of GPT-4 is nowhere close to feasible on a phone.

What I Learned Trying This Myself

I spent two weeks building a fully local AI setup on my MacBook Pro with an M2 Pro chip and 16 GB of RAM. I ran Qwen 2.5, a 7-billion-parameter model, through MLX, wired it into an agent framework with tool calling, and tried to use it as a replacement for ChatGPT. It worked end-to-end, and it was not good enough. Basic conversation and factual recall were fine, but anything requiring real reasoning fell apart. The model fabricated plausible-looking URLs instead of admitting it needed a different tool. Given conflicting information from three sources, it listed all three equally instead of evaluating which was correct. These are not problems one can fix with better prompts. They are capability limits baked into the model's size, and for complex tasks the gap between a 7B model and something like Claude or GPT-4 is not 10x, it is closer to 100x.

The RAM ceiling was lower than I expected, too. Once you account for macOS consuming 5 to 6 GB, model weights at 4 to 5 GB, and the context cache at 2 or more, having a browser open at the same time was enough to cause the system to page to disk and stall responses entirely.

What actually ended up being useful was far simpler than what I built. Apps like Osaurus on Mac and Locally AI on iPhone let you download MLX-optimized models like Gemma 4, Qwen, and Llama and start chatting immediately. No framework, no configuration. I run Osaurus with Gemma 4 on my Mac daily now and it handles everyday questions well.

The honest conclusion is that local AI excels at short, private, offline interactions and is not convincing when it tries to be more. The right approach is hybrid: local models for quick questions, cloud models for anything demanding real reasoning. Apple featured MLX at WWDC 2025, models keep getting more efficient, and what needed 16 GB two years ago runs in 4 now. The gap is narrowing, but I believe being straightforward about where things stand today matters more than overpromising where they might go. And there is one more thing about the hybrid model that circles back to the environmental question: every time you use a local model for a simple question instead of sending it to a cloud service, that is one less query hitting a data center, one less request requiring rows of GPUs cooled by thousands of gallons of water. You do not have to feel guilty about asking a simple question when the infrastructure behind it is just a chip in your pocket drawing a few watts, not a warehouse burning through electricity and freshwater to give you the same answer.

Running Large Language Models on iPhones