General

What VPS Specs Do You Need to Run a 7B, 13B, or 70B Model?

June 1, 2026
10 min read

How much RAM, VRAM, and disk you need to self-host a 7B, 13B, or 70B open-weight model, with cited memory figures and the speed you can expect.

What VPS Specs Do You Need to Run a LLM? | Carpathian Pubs

The VPS specs to run an LLM come down to one calculation: model size times quantization equals the memory you need. As a starting point, a 7B model fits comfortably in about 8 GB of RAM or VRAM at 4-bit, a 13B needs roughly 16 GB, and a 70B needs about 48 GB or more. Those are the numbers Ollama publishes as minimums for 7B, 13B, and 70B models (Ollama). You can run any of them on CPU and system RAM alone, but a GPU is what makes them fast. Everything else, context length, disk, and the runtime you choose, adjusts those figures up or down from there.

If you are still deciding whether to self-host at all, weigh it against an API first in our self-hosted LLM vs API cost guide. For picking the underlying machine, see how to choose a VPS, and for a broader hosting primer, which hosting type is best for small websites. This guide is the sizing piece: what RAM, VRAM, disk, and speed each model size demands, with the math so you can adjust for your own setup.

What determines how much memory an LLM needs?

Two things set the memory bill: the number of parameters and how many bits each parameter is stored in. A model in full 16-bit precision needs about 2 bytes per parameter, so a 7B model takes roughly 14 GB just to hold the weights (Hugging Face forums). Compress, or quantize, those weights to 4 bits and the same model drops to around 4 GB. That is the whole game.

Quantization is the lever that makes self-hosting practical. It stores each weight in fewer bits, trading a small amount of accuracy for a large cut in memory. The common levels you will see, from heaviest to lightest:

  • FP16 (16-bit): about 2 bytes per parameter. Highest fidelity, highest memory.
  • Q8 (8-bit): about 1 byte per parameter. Roughly half the memory of FP16, with accuracy loss small enough that most people cannot tell.
  • Q4 (4-bit, usually Q4_K_M): about 0.5 bytes per parameter. Roughly a quarter of FP16, and the default most people run because the quality hit is modest while the savings are large.

The rule of thumb worth memorizing: FP16 needs about 2 GB of memory per billion parameters, Q8 needs about 1 GB, and Q4 needs about 0.5 GB (LocalLLM.in). Multiply by your model size, add some overhead for the runtime and the context window, and you have your number.

How much RAM or VRAM does a 7B model need?

A 7B model is the easy case. At 4-bit quantization the weights are about 4 GB, and Ollama lists 8 GB as the working minimum, which leaves room for the operating system, the runtime, and a normal context window (Ollama). It is the size that runs on a modest laptop, a small GPU, or a CPU-only VPS without drama.

Concrete figures for a 7B model:

  • Q4 (4-bit): roughly 4 to 6 GB of memory in practice. An 8 GB machine handles it, though 8 GB leaves almost no headroom and can start swapping under load, so 12 to 16 GB is more comfortable (Local AI Master).
  • Q8 (8-bit): about 7 to 8 GB for the weights, so plan on 12 GB or more total (LocalLLM.in).
  • FP16: about 14 GB for the weights alone, so a 16 GB GPU or 24 GB of system RAM is the sensible floor (Hugging Face forums).

For most self-hosting, a 7B at Q4 on a VPS with 16 GB of RAM and a small GPU is a sane default. If you have a GPU with 8 GB or more of VRAM, the whole model fits on the card and runs fast.

How much RAM or VRAM does a 13B model need?

A 13B model roughly doubles the 7B numbers. Ollama lists 16 GB as the minimum, which holds a 4-bit 13B comfortably (Ollama). The model is meaningfully more capable than a 7B at reasoning and instruction-following, and the memory cost is still well within reach of a single mid-range machine.

Concrete figures for a 13B model:

  • Q4 (4-bit): about 8 to 10 GB for the weights, so 16 GB of memory is the practical target (Local AI Master).
  • Q8 (8-bit): about 13 to 14 GB, which fits on a 16 GB GPU or comfortably in 24 GB of system RAM.
  • FP16: about 26 to 28 GB for the weights, which pushes you to a 32 GB-plus GPU or 48 GB of system RAM.

The sweet spot for a 13B is a 16 GB GPU running Q4, or a CPU-only VPS with 24 to 32 GB of RAM if you can tolerate lower speed. If you want the full model on a single consumer card without offloading, Q4 is what gets you there.

How much RAM or VRAM does a 70B model need?

A 70B model is where hardware stops being casual. Ollama lists 64 GB of RAM as the minimum, and at 4-bit the weights alone are around 40 GB before you add context (Ollama). You can run it CPU-only on a VPS with enough RAM, but doing it fast means either a large GPU or splitting the model across multiple cards.

Concrete figures for a 70B model:

  • Q4 (4-bit): roughly 38 to 48 GB depending on context length, which is why 64 GB of RAM is the floor and a single 48 GB GPU (or two 24 GB cards) is the GPU path (Local AI Master).
  • Q8 (8-bit): about 70 GB for the weights, so you are into 80 GB-plus of memory or multi-GPU territory.
  • FP16: roughly 140 GB, which is firmly multi-GPU server hardware and rarely worth it over Q8 for self-hosting.

Many people run 70B at Q4 by offloading some layers to the GPU and keeping the rest in system RAM. On an RTX 4090 with 40 of 80 layers offloaded, a 70B at Q4 has been measured at about 18 tokens per second using 23 GB of VRAM, with the rest served from RAM (LocalLLM.in). It works, but it is slower and fussier than a 7B or 13B that fits entirely on one card.

CPU-only or GPU: which do you need?

You can run any of these models on CPU and system RAM with no GPU at all. The catch is speed. A GPU is not required for the model to load, it is what makes generation fast enough to feel interactive. A 7B on a CPU might produce a dozen tokens a second, while the same model on a consumer GPU produces a hundred or more.

On CPU alone, expect modest speeds. A Llama 3.1 8B at Q4 on a desktop CPU with twelve threads has been measured at about 14 tokens per second, and a 70B on a 64 GB machine drops to under 1 token per second, which is too slow for interactive use (llama.cpp discussion). CPU-only is fine for batch jobs, overnight processing, or light personal use where you do not mind waiting.

On a GPU, the numbers jump. A 7B at Q4 runs at roughly 40 tokens per second on an 8 GB card and well over 100 on a high-end card (LocalLLM.in). Apple Silicon sits in between thanks to its unified memory: a 7B at Q4_0 has been benchmarked from about 14 tokens per second on an M1 up to 94 on an M2 Ultra (llama.cpp discussion).

The practical guidance: use CPU-only for a small model you run occasionally or for background work where latency does not matter. Use a GPU when you need responses to feel instant, when you are serving more than one user, or when the model is large enough that CPU speed becomes unworkable.

How does context length change the math?

Context length adds memory on top of the weights, through something called the KV cache: the model's short-term memory of the conversation so far, which grows with every token in the context window. Longer context means a bigger cache, and on large models that can add gigabytes. It is the spec people most often forget when they size a machine.

For smaller models the addition is modest. On models in the 9B to 27B range, going from 8K to 32K of context adds roughly 1 to 2 GB for the cache, and pushing to 64K adds 2 to 4 GB (LocalLLM.in). For a 70B the cache is the dominant variable at long context: a 70B with grouped-query attention needs roughly 10 GB of cache at 32K and around 40 GB at 128K, which can rival the weights themselves (DigitalApplied).

The takeaway: size for the context you will use, not the maximum the model supports. If you only ever feed it short prompts, you can run a tighter machine. If you plan to push long documents through a 70B, budget extra memory for the cache and do not assume the weight figures are the whole story.

How much disk space do the model weights need?

Disk is the simplest spec to plan, because the model file on disk is roughly the same size as the weights in memory. A 4-bit 7B file is about 4 GB, a 4-bit 13B is about 8 GB, and a 4-bit 70B is around 40 GB. Higher precision means bigger files, in the same proportions as memory.

Plan disk like this:

  • 7B: about 4 GB at Q4, 8 GB at Q8, 14 GB at FP16.
  • 13B: about 8 GB at Q4, 14 GB at Q8, 26 GB at FP16.
  • 70B: about 40 GB at Q4, 70 GB at Q8, 140 GB at FP16.

Add headroom beyond the single file you intend to serve. Most people download several quantizations to compare, keep a base model alongside fine-tunes, and want room for the runtime and logs. A VPS with 50 to 100 GB of disk covers 7B and 13B work comfortably, and a 70B at Q4 wants closer to 100 GB once you account for more than one file. SSD over spinning disk matters mainly for load time, since the model is read into memory once at startup.

What tokens per second should you expect?

Tokens per second (tps) is how fast the model writes its answer, and it is the number that decides whether self-hosting feels usable. Roughly, a comfortable reading pace is around 7 to 10 tps, so anything above that feels responsive. The figure depends heavily on hardware, model size, and quantization, so treat these as ballpark.

Reasonable expectations:

  • 7B at Q4 on a GPU: about 40 tps on an entry 8 GB card, 100-plus on a high-end card (LocalLLM.in).
  • 7B at Q4 on CPU: roughly 12 to 15 tps with a modern multi-core CPU (llama.cpp discussion).
  • 70B at Q4 with GPU offload: around 18 tps on a single high-end consumer card with partial offload (LocalLLM.in).
  • 70B at Q4 on CPU only: under 1 tps, slow enough to be a batch tool, not a chat one (llama.cpp discussion).

If interactive use is the goal, a 7B or 13B on a GPU is the reliable path. A 70B is for when output quality justifies the cost and you can give it the hardware to keep pace.

The honest limitation

Every number here is a starting point, not a guarantee. Memory use and speed vary by runtime (llama.cpp, Ollama, vLLM, and others all manage memory differently), by the exact quantization variant (Q4_K_M is not identical to Q4_0), by context length, by batch size, and by how the model was built. The figures above come from vendor docs and reputable benchmarks, but the only way to know your number is to run your model, at your quantization, with your context, on your hardware, and measure it.

So size with margin. Pick the model that fits your task, choose Q4 unless you have a clear reason to go heavier, give yourself headroom above the published minimum for the operating system and context, and confirm with a test before you commit to a plan. The best self-hosting setup is the one you stop thinking about: right-sized for the model you run, fast enough for the way you use it, and not paying for memory you never touch.

About the Author

Samuel Malkasian | Founder

Samuel Malkasian | Founder

Samuel Malkasian is the founder and lead cloud architect at Carpathian, where he designed the platform's core architecture along with a range of client enterprise systems and open-source tools for AI workflows and integration. He serves as a Cyber Warfare Officer in the U.S. Army and has a background in machine learning and data science. He is currently focused on building AI infrastructure that is secure, efficient, and low-power by design.

Related Topics

llm hostingself-hosted aivpsquantizationvramollamallama.cpp