8/11/2025

Why Your Ollama is Running Painfully Slow on a Mac (And How to Fix It)

So, you've jumped into the world of local LLMs with Ollama on your Mac. It’s pretty amazing to have that kind of power running right on your own machine, right? But then it happens. You ask it a question & you wait… and wait… and your fan starts screaming, your mouse gets choppy, & the whole experience feels less like the future & more like a dial-up modem.
If you’re pulling your hair out wondering why Ollama is so slow, you're not alone. Turns out, there are a few common culprits, especially on Macs. The good news is, most of them are fixable. Let's get into it.

The Elephant in the Room: Your Mac’s Hardware

Honestly, the biggest reason for slow Ollama performance often comes down to your Mac's specs. These models are HUGE & they need a lot of resources.

RAM is King

Here's the thing: language models need to be loaded into your computer's memory to run. If you have an M1 or M2 Mac with 8GB of RAM, you're going to have a tough time with the bigger, more capable models.
Think about it. The operating system needs a chunk of that RAM, any other apps you have open (like your web browser) need some, & then Ollama comes in & asks for a massive slice of the pie. A model like Llama 3’s 8-billion parameter version can take up over 5.5GB of memory all by itself. On an 8GB machine, that leaves almost nothing for anything else.
When your Mac runs out of physical RAM, it starts using your super-fast SSD as "swap" memory. This is way better than the old spinning hard drives, but it's still MUCH slower than actual RAM. The result? Your whole computer grinds to a halt while it frantically shuffles data back & forth.
The Fix:
  • Use Smaller Models: This is the easiest fix. Instead of reaching for the 7B or 8B models, try out some of the smaller ones.
    1 qwen2:0.5b
    or
    1 tinyllama
    are great options to see if your machine can handle the workflow. You'll be surprised at how capable some of these smaller models are for many tasks.
  • Try Quantized Models: Look for models with "q" in their names, like
    1 llama3:8b-instruct-q2_K
    . These are versions of the models that have been compressed. They're not quite as "smart" as the full versions, but they use significantly less RAM & can be a great compromise.
  • Close Everything Else: Before you fire up Ollama, close every other application you don't absolutely need. Every browser tab is eating up precious memory.

The Sneaky Culprit: macOS Memory Management

Okay, so let's say you have a beast of a Mac with 32GB or even 64GB of RAM. You should be flying, right? But you're still seeing frustrating delays, especially with the really big 70B models. This one is a bit more of a deep cut.
It turns out, macOS is VERY aggressive about how it manages memory, especially VRAM (the memory on your GPU). Even if you tell Ollama to keep a model loaded in memory forever (
1 OLLAMA_KEEP_ALIVE=-1
), macOS might decide it knows better. It can silently offload the model from the super-fast VRAM on the Apple Silicon chip back to your regular system RAM.
When you go to ask your next question, there’s a delay of a few seconds while the system shuffles that massive model back into VRAM. It's not a huge delay, but it's enough to make the experience feel laggy & unresponsive. You might notice your memory usage spike, then quickly drop after you get a response – that's macOS "helpfully" clearing out the VRAM.
The Fix (This is a game-changer):
You can actually tell macOS to dedicate more of your unified memory to the GPU. This makes it less likely to offload your Ollama model.
You'll need to open up the Terminal app & run a command. Be careful here, as you're changing a system setting.

Copyright © Arsturn 2025