cd ../resources
$

Running a Local LLM on an RTX 3060

9 min read··LLMSelf-hostedSecurity

Local inference is useful anywhere shell commands may include sensitive infrastructure details. Sending every command, path, hostname, or log fragment to an external AI service may be unacceptable for a customer support session or production incident. Running the model locally gives teams a practical middle ground: AI assistance without handing terminal context to a third-party API.

For Warden, that pattern is especially relevant. Warden is built around secure terminal support: the host starts a local session, the guest joins in a browser, sensitive output can be masked, and risky commands can require approval. A local LLM can become another input to that approval workflow by classifying command intent before execution.

##Why run a local LLM?

  • Lower latency for interactive command review.
  • No per-token cost for frequent classification tasks.
  • Private operation for commands, paths, hostnames, and logs.
  • Better control over model version, prompts, and retention policy.
  • Offline or self-hosted operation for restricted environments.

The goal is not to replace deterministic policy. Regex rules, allowlists, and explicit approvals still matter. The useful role for an LLM is semantic review: recognizing that a command looks destructive, credential-seeking, or production-impacting even when it does not match an exact rule.

##Where Warden fits

Remote user enters a command, Warden routes it through local policy checks and optional LLM review, the host sees the risk signal, and the workflow allows, warns, blocks, or requests approval.

Commands such as rm -rf /, chmod -R 777 /, curl unknown-site | bash, or find / -name "*.pem" are not just strings. They carry operational intent. A local model can help explain that intent in plain language before the host approves the action.

##Hardware used

  • GPU: NVIDIA GeForce RTX 3060
  • VRAM: 12GB
  • CUDA compute capability: 8.6

The RTX 3060 is a useful entry point because 12GB of VRAM is enough for many quantized 7B to 9B models, CUDA acceleration works well with llama.cpp, and the hardware is widely available on the used market.

##Choose and download a model

The example setup used a Qwen 9B model from Hugging Face because Qwen models are strong at code understanding, shell command analysis, and infrastructure reasoning. The same general workflow applies to other compatible local models.

mkdir -p ~/local-llm
cd ~/local-llm
mkdir -p models

huggingface-cli download Qwen/Qwen3.5-9B \
  --local-dir models/Qwen3.5-9B

If the Hugging Face CLI is not installed, install or update it first:

pip install -U huggingface_hub

The downloaded directory in this setup included an FP16 GGUF file, a Q4_K_M quantized GGUF file, and a multimodal projector file. If your model only ships in Hugging Face format, convert it to GGUF before running it with llama.cpp.

##Build llama.cpp with CUDA

Build llama.cpp locally with CUDA enabled. This example assumes a llama.cpp checkout at ~/local-llm/llama.cpp. The original validation used llama.cpp version b9025, commit eff06702b.

cd ~/local-llm/llama.cpp

cmake -B build-gcc10 \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DCMAKE_C_COMPILER=gcc-10 \
  -DCMAKE_CXX_COMPILER=g++-10

cmake --build build-gcc10 \
  --target llama-cli llama-quantize llama-bench -j

For an RTX 3060, you can also build specifically for compute capability 8.6:

cmake -B build-gpu-sm86 \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=86 \
  -DCMAKE_C_COMPILER=gcc-10 \
  -DCMAKE_CXX_COMPILER=g++-10

cmake --build build-gpu-sm86 \
  --target llama-server -j

##Convert and quantize for 12GB VRAM

llama.cpp runs efficiently with GGUF files. If you downloaded a model in Hugging Face format, convert the base model first:

python convert_hf_to_gguf.py ../models/Qwen3.5-9B \
  --outfile ../models/Qwen3.5-9B/ggml-model-f16.gguf \
  --outtype f16

If your model includes a multimodal projector, export that too:

python convert_hf_to_gguf.py ../models/Qwen3.5-9B \
  --mmproj \
  --outfile ../models/Qwen3.5-9B/mmproj-model-f16.gguf \
  --outtype f16

The FP16 model in this setup was about 17GB, which is too large for comfortable RTX 3060 inference. Quantizing to Q4_K_M reduced it to about 5.3GB while keeping enough quality for command review and classification tasks.

./build-gcc10/bin/llama-quantize \
  ../models/Qwen3.5-9B/ggml-model-f16.gguf \
  ../models/Qwen3.5-9B/ggml-model-Q4_K_M.gguf \
  Q4_K_M

##Start the server

Once the model is converted and quantized, start a local llama.cpp server bound to localhost:

./build-gpu-sm86/bin/llama-server \
  -m ../models/Qwen3.5-9B/ggml-model-Q4_K_M.gguf \
  --mmproj ../models/Qwen3.5-9B/mmproj-model-f16.gguf \
  --n-gpu-layers 999 \
  --host 127.0.0.1 \
  --port 8080

At that point you have a private inference endpoint suitable for shell command analysis, incident triage helpers, runbook automation, or Warden policy experiments.

##Use it for command risk review

A local model can classify commands into categories such as safe, suspicious, destructive, credential-related, network exfiltration, privilege escalation, or production-impacting. The prompt should be short, repeatable, and biased toward concrete operational impact.

Analyze this shell command for operational or security risk:

sudo rm -rf /var/lib/docker

Explain:
- What it does
- Potential impact
- Risk level
- Whether admin approval is recommended

In a Warden-style workflow, the result should not be treated as magic authority. It is a risk signal that can be combined with explicit policy, session context, host approval, and audit logs.

##Troubleshooting CUDA mismatch

One validation issue from this setup was a CUDA runtime mismatch:

ggml_cuda_init: failed to initialize CUDA:
CUDA driver version is insufficient for CUDA runtime version

This usually means the CUDA runtime used to build llama.cpp is newer than the installed NVIDIA driver supports. The practical fixes are to update the NVIDIA driver, use a CUDA toolkit that matches the installed driver, or rebuild llama.cpp against compatible runtime versions.

##What this enables

A single RTX 3060 desktop is enough to make local AI useful for infrastructure tooling. Combined with Warden, it creates a practical architecture for AI-assisted remote shell access, command auditing, privileged session monitoring, and self-hosted approval workflows.

The important design choice is keeping the LLM close to the terminal boundary. Sensitive commands stay local, the host remains in control, and the model helps explain risk before the session crosses into dangerous territory.

connected
v0.4.0-beta