İçeriğe Atla
MS Mehmet Sarı Solution architecture notes

From 8GB to 70B: A Real Hardware Guide for Local LLMs

A guide delving into the hardware requirements for running local Large Language Models (LLMs), exploring VRAM, quantization, and performance.

100%

Running large language models (LLMs) locally has become a field of interest for both individual developers and organizations recently. For those who want to run LLMs on their own servers or computers, hardware requirements are of critical importance. Especially the amount of VRAM (Video Random Access Memory) is one of the most significant factors directly impacting the model’s size and performance. In this guide, we will delve into the hardware specifications, quantization techniques, and performance expectations for running local LLMs, covering a spectrum from 8GB VRAM up to models with 70 billion parameters.

Running models trained or fine-tuned with our own data locally, instead of on remote servers, offers significant advantages in terms of privacy, cost, and control. However, this requires making the right hardware investment. LLMs require substantial memory and processing power based on their parameter counts. As the model size increases, these requirements also grow exponentially. In this article, I will explain step-by-step what you need to pay attention to in order to understand these requirements and choose the hardware that best suits your budget.

Starting with 8GB VRAM: Small Models and Limitations

8GB of VRAM can be considered a small amount by today’s standards. However, this does not mean it’s entirely insufficient to step into the world of local LLMs. With this level of VRAM, you can generally run models with fewer parameters or larger models that have undergone quantization. For instance, some quantized versions of 7 billion parameter (7B) models or 3 billion parameter (3B) models can be run on this hardware.

Your biggest limitation when running an LLM on this hardware is the model itself. You won’t be able to run large and complex models. The “quantization” level of the models also becomes very important here. Quantization is the process of reducing memory usage by converting the model’s weights to lower-precision data types (e.g., INT8 or INT4 instead of FP16). When working with 8GB of VRAM, 4-bit quantization (Q4) applied models will typically be your best option. This allows you to run larger models on low VRAM, albeit with a slight loss in model accuracy.

As an example, you can run a version of Llama 3 with 8 billion parameters, quantized to 4-bit, locally using the command ollama run llama3:8b-instruct-q4_K_M. Models of this type can be sufficient for tasks like simple chatbots, text summarization, or code completion. However, they will be insufficient for more complex analyses or generating long texts. With this level of hardware, inference speeds will also be low, meaning the waiting time for the model to produce a response will be longer.

12GB - 16GB VRAM: The Ideal Space for Mid-Range LLMs

VRAM amounts between 12GB and 16GB can be considered the “sweet spot” for those looking to run local LLMs. This range allows you to run a wider variety of models. Specifically, quantized versions of 7B and 13B parameter models can be run smoothly. Even some aggressively quantized versions of 30B models can be tested on this hardware.

Within this VRAM range, it becomes possible to run models with 8-bit quantization (Q8) or more advanced 4-bit quantization formats (e.g., q4_K_M or q4_K_S). This allows you to optimize memory usage while maintaining a relatively high model accuracy. Inference speeds also increase noticeably compared to 8GB VRAM, offering a more interactive user experience.

For instance, running a 13 billion parameter model with a quantization level like q5_K_M can provide quite reasonable performance on a system with 16GB of VRAM. This can be sufficient for more complex question-answering tasks, creative writing, or generating longer code snippets. With this level of hardware, it also becomes possible to try multiple models simultaneously or experiment with faster iterations.

It’s important that the rest of your system is also adequate at this hardware level. A capable CPU, a fast SSD, and sufficient RAM (typically 32GB is recommended) will positively impact the overall performance of the LLM. Especially when parts of the model that don’t fit into VRAM are loaded into RAM or during data processing, the CPU’s role increases.

24GB - 32GB VRAM: High Performance and Large Models

24GB and 32GB of VRAM offer a serious performance level for running local LLMs. With this hardware, you can comfortably run quantized versions of 30B and even 70B parameter models. This means you can use models that were previously only possible on high-end servers on your own computer.

30B models generally run smoothly on this VRAM with 4-bit or 5-bit quantization. For 70B models, 4-bit quantization (Q4) or lower levels (Q3, Q2) may be required. For example, running a model like llama3:70b-instruct-q4_K_M might be possible on a system with 32GB of VRAM. This means you can work with incredibly powerful and capable models.

With this level of VRAM, inference speeds are quite high. This means you can get fast responses even in complex tasks, develop live applications, or analyze large datasets. Furthermore, with this hardware, you can also experiment with models that have less aggressive quantization. For instance, while running a 70B model in FP16 (16-bit floating-point) format requires approximately 140GB of VRAM, 4-bit quantization reduces this need to about 35-40GB. Therefore, 32GB of VRAM is close to the limit even for quantized 70B models.

At this hardware level, it’s important that the rest of your system supports this performance. A powerful multi-core CPU, fast NVMe SSDs, and ample system RAM (64GB or more is recommended) are essential. Especially when working with large models, model weights need to be transferred quickly from RAM to the GPU. Additionally, this level of hardware can also be used to combine multiple GPUs to run larger models or achieve even higher performance.

48GB VRAM and Beyond: Professional Use and Giant Models

48GB of VRAM and more is generally reserved for professional uses, research labs, and advanced users who want to run the largest LLMs. With this level of hardware, it’s possible to run the largest and most capable models with minimal or no quantization.

Systems using multiple cards like NVIDIA RTX 4090 (24GB) or RTX 3090 (24GB), or professional cards like NVIDIA A6000 (48GB), fall into this category. These cards have enough VRAM to run 70B parameter models in FP16 format. This means the model can be run with its highest accuracy and performance.

At this level of hardware, you can not only perform inference but also fine-tune models. Fine-tuning is the process of training an existing LLM on a specific task or dataset to improve its performance. This process requires significantly more VRAM and processing power than inference. For example, fine-tuning a 70B model might require 80GB or more of VRAM.

With this level of hardware, it’s possible to push the boundaries of LLMs. It’s even possible to reach terabytes of VRAM by combining multiple GPUs with technologies like NVLink and run the largest models. Such systems are typically used for AI research, large-scale natural language processing projects, and complex analytical tasks. System RAM of 128GB or more is recommended, and high-speed PCIe 4.0 NVMe SSDs are indispensable for storage.

The Role of Quantization: VRAM Savings and Performance Balance

As I mentioned earlier, quantization plays a key role in running local LLMs. The primary goal of quantization is to reduce the model’s memory requirements. It does this by reducing the precision of the data types used to represent the model’s weights. For example, when you store a model’s weights as 4-bit integers (INT4) instead of 16-bit floating-point (FP16), you reduce memory usage by roughly 4 times.

However, it would be misleading to think that quantization only provides memory savings. Lower-precision data types can be processed faster by the CPU and especially the GPU. This can lead to a significant increase in inference speeds. Thus, quantization both frees up VRAM and can boost performance.

It’s important to strike a balance between quantization levels. Lower quantization levels (e.g., Q2 or Q3) provide greater memory savings and achieve higher speeds, but can lead to more noticeable losses in model accuracy. Higher quantization levels (e.g., Q6 or Q8) offer accuracy closer to the original model but use more VRAM and can be slower.

As a general rule, if you want to run a 70B model on your local machine, you will need to use at least 4-bit quantization (Q4). For 30B models, Q5 or Q6 can offer a good balance. 13B and 7B models can be run with Q8 or even FP16 (if sufficient VRAM is available). It is recommended to experiment with different models and levels to determine which quantization level is most suitable for you.

Real-World Hardware Options and Pricing

When choosing hardware for running local LLMs, you need to consider your budget and performance goals. Here are some popular options available on the market and their estimated price ranges:

  • Entry Level (8GB - 12GB VRAM):

    • GPUs: NVIDIA GeForce RTX 3060 (12GB), RTX 4060 Ti (8GB/16GB).
    • Estimated Cost: ~$300-500 for GPU. Total system ~$700-1000.
    • Use Case: Small LLMs (7B Q4/Q5), simple tasks.
  • Mid-Range (16GB - 24GB VRAM):

    • GPUs: NVIDIA GeForce RTX 3090 (24GB), RTX 4070 Ti SUPER (16GB), RTX 4080 SUPER (16GB), RTX 4090 (24GB).
    • Estimated Cost: ~$700-1600 for GPU. Total system ~$1500-2500.
    • Use Case: Mid-sized LLMs (13B Q4/Q5, 30B Q4), faster inference.
  • High-End (32GB - 48GB VRAM):

    • GPUs: NVIDIA RTX 4090 (24GB) (multiple), NVIDIA RTX A5000 (24GB), NVIDIA RTX A6000 (48GB).
    • Estimated Cost: ~$1500-4500 per card. Total system ~$3000-6000+.
    • Use Case: Large LLMs (70B Q4), fine-tuning, professional use.

These prices may vary based on market conditions and supply. Additionally, not just the GPU, but the rest of the system components are also important. Sufficient CPU power (e.g., Intel Core i7/i9 or AMD Ryzen 7/9 series), fast NVMe SSD storage (at least 1TB recommended), and ample system RAM (32GB minimum, 64GB+ ideal) will greatly affect the overall experience.

When choosing hardware, it’s also beneficial to check which GPUs and technologies are supported by the LLM frameworks you plan to use (e.g., Ollama, LM Studio, KoboldAI, Text Generation WebUI).

Speed, VRAM, and Quantization: The Anatomy of Performance

The primary factors affecting performance in local LLMs are VRAM amount, quantization level, and the hardware’s processing power. These three are tightly interconnected, and an improvement in one can affect the others.

VRAM Amount: This is the most decisive factor. If the entire model or a large portion of it fits into VRAM, inference speed increases directly. If the model doesn’t fit into VRAM, it needs to be moved to system RAM or even disk, which dramatically reduces performance. Sufficient VRAM requires less quantization, which preserves model accuracy.

Quantization Level: Reduces VRAM requirements and generally increases inference speed. However, very low quantization levels can lead to accuracy loss. Choosing the right quantization level is about balancing VRAM and accuracy. For example, if you want to run a 70B model with 32GB of VRAM, 4-bit quantization is essential. If you have 48GB of VRAM, you can try higher levels like Q5 or Q6.

Processing Power (GPU/CPU): Once the model is loaded into memory, the actual computation is performed by the GPU. The GPU’s core count, clock speed, and memory bandwidth directly affect inference speed. The CPU is important for loading the model into VRAM, data preprocessing, and background tasks. A high-performance GPU, combined with sufficient VRAM and correct quantization, allows you to achieve the fastest inference times.

For example, let’s run the same 70B model on two different systems:

  1. System A: 32GB VRAM, RTX 4090 (24GB) + RTX 3060 (12GB) (VRAM sharing between GPUs or different models) + powerful CPU. Model: llama3:70b-instruct-q4_K_M.
  2. System B: 48GB VRAM, NVIDIA A6000, powerful CPU. Model: llama3:70b-instruct-q5_K_M.

System B, with its higher VRAM and more optimized card, will perform inference faster than System A and likely be able to use a model with higher accuracy. On System A, the model might need to be split or more aggressive quantization might be required, affecting performance. The token generation speed (tokens/sec) will show significant differences between these two systems.

In conclusion, your local LLM experience is largely limited by your hardware. By carefully evaluating your budget and goals, you need to choose the hardware combination that is most suitable for you. Remember that this field is rapidly evolving, and it’s likely we will see more efficient models and more affordable hardware in the future.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

MS

Mehmet Sarı

Çözüm Mimarı & IT Altyapı Uzmanı (MSP)

Çözüm mimarisi, network, sunucu altyapıları, yedekleme, storage, güvenlik ve MSP operasyonu ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts