Can You REALLY Pool VRAM from NVIDIA and Intel GPUs for a Local LLM? The Surprising Answer.
Z
Zack Saadioui
8/12/2025
Can You REALLY Pool VRAM from NVIDIA and Intel GPUs for a Local LLM? The Surprising Answer.
So, you’re trying to run a beefy Large Language Model (LLM) on your local machine. You’ve got a pretty decent NVIDIA card, but that new 70-billion parameter model is just laughing at your 12GB of VRAM. Then you look over at your Intel CPU & see that it has integrated graphics, or maybe you even have a dedicated Intel Arc card sitting in a spare slot. A brilliant, desperate idea pops into your head: "Can I just... smush them together? Can I pool the VRAM from my NVIDIA & Intel GPUs to create one giant memory space?"
It's the dream, right? Combining the 12GB from your RTX 3080 with the 16GB from an Intel A770 to get a whopping 28GB of VRAM. You'd be able to run some seriously powerful models.
Well, I've got good news & I've got some "let's be realistic" news.
The short answer is no, you can't pool VRAM in the way you're thinking. There’s no magic button to make Windows or Linux see your NVIDIA & Intel GPUs as one single, unified graphics card with combined memory. They have fundamentally different architectures, different drivers (this is a BIG one), & different ways of talking to the system.
But here’s the thing, & this is where it gets exciting: you can still USE both GPUs to run a single, large LLM. It's not called "pooling," but rather "distributed inference" or "model sharding." & honestly, it's pretty cool that it works at all.
So, let's dive into how this actually works, what the catches are, & why you might want to give it a shot anyway.
The Real Solution: Distributed Inference, Not VRAM Pooling
Forget the idea of a single VRAM pool. Instead, think of it like a team of specialists. You have a model that's too big for any single GPU to handle. So, you split the model's layers—its "brain"—into chunks. You tell your powerful NVIDIA card, "Hey, you handle these first 40 layers," & then you tell your Intel card, "You take the next 40 layers."
When you send a prompt to the LLM, the data gets processed by the first GPU, which then passes its results over to the second GPU to continue the work. They're not sharing memory, but they are sharing the workload. This process of splitting the model across different hardware is what we call distributed inference.
The key is getting them to talk to each other. Because they speak different languages (NVIDIA's CUDA vs. Intel's preference for Vulkan or OpenCL), you need a universal translator, or at least a very clever manager.
The "How-To": Making NVIDIA & Intel Work Together with
1
llama.cpp
The most promising & flexible way to pull this off right now is by using a tool that many of you in the local LLM scene are probably familiar with:
1
llama.cpp
. It's a powerhouse for running LLMs efficiently, & it has some clever tricks up its sleeve.
The secret sauce here is
1
llama.cpp
's RPC (Remote Procedure Call) backend. This sounds complicated, but here’s the gist of it: you can run different parts of
1
llama.cpp
as separate server instances, & have them communicate over your local network (even within the same machine).
Here’s a conceptual walkthrough of how you’d set this up. It's a bit technical, but totally doable.
Step 1: The Two Builds
The first thing you need to do is create two separate builds of
1
llama.cpp
. One build will be specifically for your NVIDIA GPU, & the other for your Intel GPU.
NVIDIA Build: You'll compile
1
llama.cpp
with the CUDA backend enabled. This is the standard way to get the best performance out of NVIDIA cards. During the
1
cmake
process, you'll specifically enable CUDA support.
Intel Build: For your Intel Arc card or integrated graphics, you'll compile a separate version of
1
llama.cpp
with the Vulkan backend enabled. Vulkan is a graphics & compute API that is cross-platform & supported by Intel, AMD, & even NVIDIA. This build will be your "Intel specialist."
Step 2: Start the RPC Servers
Now, you'll launch two
1
rpc-server
instances from your compiled
1
llama.cpp
folders. Each server will be tasked with controlling one of your GPUs.
You'll start the
1
rpc-server
from your NVIDIA build & tell it to use the CUDA backend. You'll assign it a specific port number, let's say
1
50051
.
Then, you'll start another
1
rpc-server
from your Intel/Vulkan build. This one will use the Vulkan backend to control the Intel GPU. You'll assign it a different port, like
1
50052
.
At this point, you have two separate processes running, each one ready to accept work for its designated GPU.
Step 3: Run the Main Client
This is where it all comes together. You'll launch the main
1
llama.cpp
command-line interface (or server) but with a special argument. Instead of telling it to use a local GPU, you'll point it to the RPC servers you just started.
You'd use a command that includes something like
1
--rpc 127.0.0.1:50051 127.0.0.1:50052
.
When
1
llama.cpp
loads the model, it will see the available VRAM on BOTH RPC endpoints. It will then automatically split the model's layers across them. For example, it might load the first 20GB of the model onto your NVIDIA card via the first RPC server, & the remaining 15GB onto your Intel card via the second RPC server.
And just like that, you're running a massive model that neither card could handle on its own. The main
1
llama.cpp
client manages the whole process, sending the right data to the right GPU at the right time. The performance overhead is surprisingly small for many use cases, especially considering you're now able to run models you simply couldn't before.
What About Other Tools like Ollama?
1
Ollama
is another fantastic tool for running local LLMs, known for its simplicity. It does support multi-GPU setups. However, its support for heterogeneous (mixed-vendor) setups is a bit more of a mixed bag.
There are user reports & GitHub issues showing that
1
Ollama
can sometimes get confused in a mixed-GPU environment. For instance, it might only see the GPU with the smaller amount of VRAM, or default to the integrated GPU instead of the more powerful discrete one. While you can sometimes force it to use a specific GPU using environment variables like
1
CUDA_VISIBLE_DEVICES
, getting it to intelligently split a model across an NVIDIA & Intel card isn't as straightforward or well-documented as the
1
llama.cpp
RPC method.
It's an area of active development, so this could improve. But for now, if you're serious about a mixed-vendor setup,
1
llama.cpp
seems to be the most reliable & flexible route.
Why True VRAM Pooling Is Just a Dream (For Now)
So why can't we just have that single, glorious pool of VRAM? The technical hurdles are pretty significant.
Driver Conflicts & Architecture: This is the biggest wall. NVIDIA's entire AI ecosystem is built around CUDA, a proprietary software platform. Intel has its own set of drivers & is pushing open standards like Vulkan & oneAPI. These driver stacks are complex pieces of software that manage everything from memory allocation to task scheduling. Trying to make them coexist & share resources at a low level on the same system can lead to instability, crashes, or one simply not working. They just weren't designed to play nice in the same sandbox.
Memory Management is Different: How an NVIDIA GPU manages its VRAM is different from how an Intel GPU does. They have different internal structures & access patterns. Trying to create a single, coherent memory space that both can access without massive performance penalties would require a level of cooperation & standardization that simply doesn't exist between competing hardware vendors.
The Physical Connection Bottleneck: Even if you could solve the software issues, you have a physics problem. GPUs on a motherboard communicate with each other over the PCI Express (PCIe) bus. While modern PCIe gens are fast, they are an order of magnitude slower than a GPU accessing its own onboard VRAM. High-end NVIDIA data center GPUs use a special high-speed interconnect called NVLink to share data between cards at incredible speeds. Consumer cards don't have this, so any data passed between your NVIDIA & Intel card has to take the relatively slow PCIe highway. This introduces latency, which brings us to our next point.
Let's Talk Performance: The Reality of a Mixed-GPU Setup
Okay, so you've got your
1
llama.cpp
RPC setup running. What should you expect?
First, the amazing part: you can load a model that you couldn't before. That's a HUGE win.
However, it won't be as fast as running a model on a single GPU that has enough VRAM to hold the entire thing. Every time the model needs to transition from a layer on the NVIDIA card to a layer on the Intel card, that data has to be passed over the PCIe bus. This adds a small amount of latency to each token that's generated.
The performance will also likely be bottlenecked by your slowest GPU. If your NVIDIA card can process its layers in 5 milliseconds but your Intel card takes 15 milliseconds, the overall speed will be closer to the 15ms mark. The whole chain is only as fast as its slowest link.
That said, for many people, the trade-off is absolutely worth it. A slightly slower response time is infinitely better than no response at all because the model won't even load. It turns two "okay" GPUs into one "pretty darn capable" LLM-running machine.
The Bigger Picture: Heterogeneous Computing is the Future
This whole endeavor, while a bit of a hack for consumers, is part of a much larger trend in computing. Data centers have been using heterogeneous architectures for years, mixing different types of processors (CPUs, GPUs, TPUs) to handle massive AI workloads efficiently. Companies are realizing that a one-size-fits-all approach to hardware isn't the most efficient or cost-effective.
What you're doing with your mixed-GPU setup is a grassroots version of what the big cloud providers are doing at scale. As AI models continue to grow, we're going to see more software & hardware solutions designed specifically for these kinds of mixed environments.
For many businesses, however, the complexity of managing a local, multi-GPU setup for something like a customer service chatbot can be a major distraction from their core operations. While it's a fun & powerful tool for developers & enthusiasts, deploying a production-ready AI requires reliability, scalability, & ease of use. This is where managed solutions come in. For instance, Arsturn helps businesses bypass this hardware headache entirely. It allows you to build a no-code AI chatbot that's trained on your own company data. You get the power of a custom AI that can provide instant customer support, answer questions, & engage with website visitors 24/7, without ever having to worry about VRAM, drivers, or RPC servers. It's a practical business solution for a business problem.
The Final Word
So, can you pool VRAM from mixed NVIDIA & Intel GPUs? Nope.
But can you use them in tandem to run a single, massive LLM? ABSOLUTELY.
Thanks to the cleverness of projects like
1
llama.cpp
, you can use a distributed inference approach to split the workload between your cards. It requires a bit of technical setup, compiling separate builds & running RPC servers, but it's a powerful way to breathe new life into your existing hardware. You're essentially building your own mini heterogeneous computing cluster.
You'll take a slight performance hit compared to a single, giant GPU, but you'll unlock the ability to run models that were previously out of reach. For the AI enthusiast on a budget, that's a trade worth making any day of the week.
Hope this was helpful! Let me know if you've tried a setup like this or have any other questions. It's a fascinating area & we're all learning as we go.