8/10/2025

So You Want to Use a Local LLM with Claude Code? Here's the Real Deal.

Hey everyone, hope you're doing well. There's been a TON of buzz lately about the new open-source models, especially the absolute powerhouse that is Z.ai's GLM 4.5. It's topping leaderboards, showing incredible coding & reasoning skills, & generally getting people in the local LLM community REALLY excited.
And naturally, when a hot new model drops, the first question is: "How do I hook this up to the best tools?"
A question I've seen pop up is, "How do I use GLM 4.5 with Claude Code?" It makes sense. You see benchmarks where GLM 4.5 is tested with something called the "Claude Code agent framework," & you think, "Great, I'll download that framework & plug in my local model!"
Here's the thing, and I'm just going to be upfront about it: you can't.
But don't click away! The reason why you can't is important, & the solution is actually way more exciting. It involves building your own, private, super-powered coding assistant that's arguably even better because it's all yours.
Let's dig in.

First, What Exactly is "Claude Code"?

So, this is the source of the confusion. "Claude Code" isn't a universal tool or an open-source framework you can just download. It's one of two things, both proprietary to Anthropic (the creators of the Claude models):
  1. A Specific Product: Anthropic has a service called Claude Code. You can use it on their platform for advanced coding tasks, like automating security reviews. It's a finished product, like ChatGPT or Claude itself.
  2. An Internal Agent Framework: When Anthropic benchmarks their own models (like Claude 4 Opus), they use a sophisticated internal system to get the best possible performance. They refer to this as the "Claude Code as agent framework." Think of it as their secret sauce—a highly optimized, internal-only agent that knows how to perfectly prompt & interact with their models to solve complex coding problems.
So, when you see a benchmark saying a model was tested "with the Claude Code framework," it doesn't mean the framework is a separate, usable thing. It means the model was put through its paces by Anthropic's own best-in-class, private testing harness.
The key takeaway is you can't pour your own engine (GLM 4.5) into their car (Claude Code). But what you can do is build your own car. And honestly, it's a blast.

Meet Your New Engine: GLM 4.5

Before we get into the "how-to," let's just appreciate what a beast GLM 4.5 is. Released in the summer of 2025 by Z.ai (formerly Zhipu AI), this model family turned a lot of heads.
Here's the quick rundown:
  • It's a Mixture-of-Experts (MoE) model: This is a fancy way of saying it's incredibly efficient. The full GLM 4.5 model has 355 billion parameters, but it only uses about 32 billion (around 9%) for any given task. This gives you the power of a massive model with speeds closer to a much smaller one.
  • It Has a "Thinking" Mode: GLM-4.5 has a unique dual-mode system. For simple, quick questions, it uses a "non-thinking" mode for instant answers. But for complex tasks like coding or tough logic problems, it switches into a "thinking" mode, allowing it to reason more deeply before responding. This is a big deal for getting high-quality, well-structured code.
  • It's an Agentic Powerhouse: The model was specifically designed for "agentic" tasks. This means it's great at using tools, calling functions, & executing multi-step plans. Its tool-calling success rate is over 90%, beating out many top proprietary models. This is PERFECT for a coding assistant that needs to do more than just write code—it needs to understand a project's structure, run commands, & debug.
  • It's Open & Available: The model weights are available on Hugging Face, & it has day-one support from key open-source inference engines. This is why we can have this conversation at all!
Okay, so we have our engine. Now let's build the rest of the car.

Part 1: Running GLM 4.5 Locally (Your Personal AI Server)

The first step is to get the model running on your own hardware so you can send it requests. This is like setting up your own private API endpoint that's not rate-limited or controlled by anyone else. You have two main paths here depending on your hardware.

Option A: The Power User Setup with vLLM (For Serious Hardware)

If you have a rig with some serious GPU VRAM (we're talking one or more high-end NVIDIA cards like a 3090, 4090, or professional-grade A100s/H100s), the best way to run GLM 4.5 is with
1 vLLM
.
1 vLLM
is a super-fast inference engine that's become the standard for high-throughput serving. But, for GLM 4.5, there's a small catch: you need to build
1 vLLM
from source to get the best performance & full tool-calling support.
A Reddit user in the r/LocalLLaMA community shared a fantastic guide, & here's the gist of it:
  1. Clone the vLLM Repo:
    1 git clone https://github.com/vllm-project/vllm.git
  2. Install from Source:
    1 cd vllm
    & then follow their instructions to build it. It usually involves a couple of
    1 pip install
    commands. This ensures you have the very latest updates that support GLM 4.5's unique architecture.
  3. Create a Chat Template: GLM 4.5's "thinking" mode can sometimes be a bit chatty for direct API calls. The community created a custom Jinja chat template to disable this for more predictable, tool-like behavior. You save this as a file (e.g.,
    1 glm-4.5-nothink.jinja
    ).
  4. Launch the Server: You then run a command that looks something like this:

Copyright © Arsturn 2025