The Big Picture
Every time you ask an AI something, your words travel across the internet to a warehouse full of GPUs, get processed, and come back. That round trip is how nearly all AI works today. It needs a connection, it costs money, and your data passes through servers you don't control. You're trusting a company with every prompt you send.
I want to put forth a new approach gaining traction regarding on-device AI, which is exactly what it sounds like. Instead of relying on a remote server, you run the model on the phone or laptop sitting in front of you. Apple has been quietly building the tools to make this work through an open-source framework called MLX, and it's further along than most people realize.
Cloud vs. on-device AI
What Is MLX?
MLX is a machine learning framework Apple built and released as open source under the MIT license. It's designed around Apple Silicon, the chips in every modern iPhone, iPad, and Mac.
The reason Apple Silicon matters here comes down to how memory works. Most computers keep the CPU and GPU in separate memory pools. When an AI model needs the GPU to process something, data gets copied back and forth between those pools, which is slow and wasteful. Apple's chips use what's called unified memory, where the CPU and GPU share one pool. Nothing needs to be copied. MLX is built to take advantage of that, so running AI workloads on Apple hardware avoids a bottleneck that slows things down on other platforms.
Traditional separate memory vs. Apple's unified memory
The framework supports Python, Swift, C++, and C. Swift is the important one for phones since it's what iOS apps are built with, so developers can write native iPhone apps that run ML models through MLX directly. It uses Metal, Apple's GPU programming interface, for the actual hardware acceleration. There's also a growing MLX Community on Hugging Face with thousands of models already converted and ready to use, which lowers the barrier to entry considerably.
Popping the Hood Open
People are already running LLMs on iPhones. In January 2025, a developer named Christopher Charles (@cristofrcharles) posted a demo of it working using Apple's mlx-swift-examples app. Awni Hannun, an engineer on the MLX team at Apple, followed up with a more detailed write-up explaining the technical side.
The way it works is fairly straightforward. You download an open-source model like Microsoft's Phi-4 from Hugging Face to the phone's storage. That's a one-time download. From there, the app loads the model's weights into the device's unified memory, takes whatever you type, breaks it into tokens the model can process, and generates a response one token at a time using just the phone's chip. No server involved. It works in airplane mode.
Making Models Fit
A model like Phi-4 has billions of parameters stored as 16-bit floating point numbers, which adds up to far more memory than any phone has. The workaround is quantization, which compresses those weights down to 4-bit integers. That cuts memory usage by about 75% without much impact on the quality of responses. What would normally require 8 GB fits into around 2 GB, which is how you get a real language model running on a device with limited RAM.
Models are also being designed with this constraint in mind from the start. Google's Gemma 4, released in April 2026, includes E2B and E4B variants built specifically for edge devices like phones. The E2B model has around 2 billion effective parameters and is small enough to run on a Raspberry Pi, let alone an iPhone. These aren't stripped-down afterthoughts either. Google designed them with per-layer embeddings to maximize quality at small sizes, and they're released under the Apache 2.0 license so anyone can use them. The fact that companies like Google are investing in making their best models run on phones says a lot about where this is heading.
Why?
The practical benefits are pretty obvious once you think about them. Your data never leaves the device, so there are no server logs and no third party storing your conversations. Responses come back faster because there's no network round trip. It works without internet, whether you're on a plane or somewhere with bad service. And it's free, no subscription, no API costs, since both the models and the framework are open source.
Privacy
Nothing leaves the phone. No logs, no third-party access to your prompts.
Speed
No network latency. Responses generate locally without waiting on servers.
Offline
Works in airplane mode, on the subway, anywhere without a connection.
Cost
No API fees or subscriptions. The models and framework are free.
The larger implication is about access. Using AI right now usually requires either a paid subscription or technical know-how to work with APIs and cloud infrastructure. On-device AI removes that barrier entirely. If you have a recent iPhone, you already have the hardware.
What Can Run on Your iPhone?
Not every iPhone can handle this equally. RAM is the main constraint since bigger models need more of it, and different iPhones have different amounts.
| Device | Chip | RAM | Max Model (4-bit) | Experience |
|---|---|---|---|---|
| iPhone 16 Pro | A18 Pro | 8 GB | ~7B parameters | Excellent |
| iPhone 15 Pro | A17 Pro | 8 GB | ~7B parameters | Excellent |
| iPhone 15 | A16 | 6 GB | ~3B parameters | Moderate |
| iPhone 14 | A15 | 6 GB | ~3B (tight) | Limited |
| iPhone 13 & older | A15 / older | 4 GB | Not practical | Not Supported |
The Cost of Running AI on Your Phone
The privacy and offline benefits are real, but running a language model locally comes with trade-offs that are worth being honest about. This isn't free in the way it might seem at first glance.
Battery drain is the most immediate one. A study by Greenspector found that running a local model consumed roughly 29 times more energy than sending the same query to a cloud-based model like ChatGPT. In their tests, phones running local models like Llama 3.2 and Gemma 2 drained completely in under two hours of continuous use. That's not a typo. A phone that lasts you a full day under normal use might die before lunch if you're running an LLM on it steadily.
Then there's thermal throttling. Phones aren't designed to sustain heavy compute loads the way a laptop or desktop is. They're passively cooled, meaning there's no fan, just a thin metal chassis trying to dissipate heat. Research from a 2026 study on mobile LLM inference showed that the iPhone 16 Pro starts at around 40 tokens per second but drops to about 22 tokens per second within just two inference rounds, a 44% reduction. The phone spends about 65% of sustained use in a thermally throttled state. In plain terms, the longer you use it, the slower it gets, because the phone is actively protecting itself from overheating.
There's also the question of long-term hardware wear. Batteries have a finite number of charge cycles, typically between 500 and 1,000 before they degrade noticeably. Running intensive ML workloads drains the battery faster, which means you're cycling through those charges more quickly and shortening the overall lifespan of the battery. The GPU itself isn't immune either. Sustained high-temperature operation accelerates wear on the logic board and other components. This is the kind of damage that builds up gradually and isn't obvious until the phone starts behaving differently a year or two down the line.
Memory is another hard ceiling. Even the newest iPhones top out at 8 GB of RAM, and the operating system and other apps need some of that. In practice, you're limited to models in the 3 to 7 billion parameter range after quantization. Something like GPT-4, which is orders of magnitude larger, is nowhere close to running on a phone.
Where Things Are Heading
On the accessibility side, things have improved faster than expected. You no longer need Xcode or a developer account to try this. Apps like Locally AI are on the App Store and let you download and run models like Gemma, Llama, and Qwen directly on your iPhone with no setup at all. You pick a model, it downloads, and you start chatting. No account, no configuration. For most people this is the simplest way to experience on-device AI today.
The hardware constraints are real but they're not static. Apple featured MLX at WWDC 2025, signaling this is a priority for them. Models keep getting more efficient, quantization techniques keep improving, and what needed 16 GB of memory two years ago runs in 4 GB now. Apple's upcoming iPhone 17 Pro is expected to include vapor chamber cooling specifically to handle sustained AI workloads without throttling as aggressively. The gap between what a phone can do and what a data center can do is narrowing with every generation.
Your phone isn't replacing cloud AI tomorrow. But for short, private, offline interactions on hardware you already own, it works right now. The limitations are real, and anyone using this should understand what it costs their device. But the trajectory is clear, and it's moving fast.