NVIDIA GPU Passthrough to Debian VM on Proxmox
TL;DR:
This guide documents the complete process of setting up NVIDIA GPU passthrough to a Debian VM on Proxmox, including detailed troubleshooting steps and real error resolution.
Overview
This guide documents the complete process of setting up NVIDIA GPU passthrough to a Debian VM on Proxmox, including detailed troubleshooting steps and real error resolution. This is based on actual implementation experience with an RTX 3050 GPU.
Initial Setup Verification (ProxMox)
1. Host System Requirements Check
First, verify your IOMMU setup:
dmesg | grep -i iommu
Expected output should show IOMMU enabled:
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt
[ 0.537426] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt
[ 1.445510] DMAR-IR: IOAPIC id 8 under DRHD base 0xfbffc000 IOMMU 1
[ 1.445512] DMAR-IR: IOAPIC id 9 under DRHD base 0xfbffc000 IOMMU 1
[ 4.824723] iommu: Default domain type: Passthrough (set via kernel command line)
If you don't see this, add to /etc/default/grub:
# For AMD:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
# For Intel:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
2. GPU Identification and VFIO Setup
Check your GPU details:
lspci -nnk | grep -A 3 -i nvidia
Real output example:
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107 [GeForce RTX 3050 8GB] [10de:2582] (rev a1)
Subsystem: ASUSTeK Computer Inc. GA107 [GeForce RTX 3050 8GB] [1043:8890]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:2291] (rev a1)
Subsystem: ASUSTeK Computer Inc. Device [1043:8890]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
Critical info to note:
- GPU ID: [10de:2582]
- Audio ID: [10de:2291]
- Current driver: should be vfio-pci
VM Configuration
1. Base VM Setup
Essential VM configuration (/etc/pve/qemu-server/<VM Number>.conf):
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 36
cpu: host
efidisk0: secondary-repo:116/vm-116-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
machine: q35
memory: 32768
scsihw: virtio-scsi-single
2. GPU Passthrough Configuration
Add these lines:
hostpci0: 04:00.0,pcie=1,x-vga=1
hostpci1: 04:00.1,pcie=1
3. VNC Console Setup (Critical for UEFI access)
vga: qxl
args: -vnc 0.0.0.0:0
Detailed Driver Installation Process
1. Repository Setup
Edit /etc/apt/sources.list:
deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
2. Initial Driver Installation
apt update
apt install linux-headers-$(uname -r)
apt install nvidia-driver firmware-misc-nonfree
3. Version Alignment (Critical)
Check available versions:
apt-cache policy firmware-nvidia-gsp
Example output:
firmware-nvidia-gsp:
Installed: 535.183.06-1~bpo12+1
Candidate: 535.183.06-1~bpo12+1
Version table:
*** 535.183.06-1~bpo12+1 100
100 http://deb.debian.org/debian bookworm-backports/non-free-firmware amd64 Packages
535.183.01-1~deb12u1 500
500 http://deb.debian.org/debian bookworm/non-free-firmware amd64 Packages
4. Version-Specific Installation
apt install nvidia-driver=535.183.01-1~deb12u1 firmware-nvidia-gsp=535.183.01-1~deb12u1
Secure Boot Configuration
1. Key Generation
openssl req -new -x509 -newkey rsa:2048 -keyout /root/MOK.priv -outform DER -out /root/MOK.der -nodes -days 36500 -subj "/CN=NVIDIA_KEY/"
chmod 600 /root/MOK.priv /root/MOK.der
2. Key Enrollment
mokutil --import /root/MOK.der
3. UEFI Key Management
Step-by-step enrollment process:
1. Reboot system
2. At MOK management screen, select "Enroll key from disk"
3. Navigate through filesystem:
- Select first PciRoot option
- Select "debian"
- Navigate to EFI directory
- Select the MOK key file
4. Verify key details match:
[Serial Number]
35:1F:EC:11:12:28:CF:7B:2E:94:B4:C4:25:D< REDACTED >
[Issuer]
CN=NVIDIA_KEY
[Subject]
CN=NVIDIA_KEY
- Select "Continue" and "Yes" to enroll
Real-World Troubleshooting
Error 1: Key Rejection
modprobe: ERROR: could not insert 'nvidia_current': Key was rejected by service
Solution steps:
1. Verify key enrollment:
mokutil --list-enrolled
- Check if key is properly enrolled (should show two keys):
[key 1]
SHA1 Fingerprint: 53:61:0c:f8:1f:bd:7e:0c:eb:67:91:3c:9e:f3:e7:94:a9:63:3e:cb
[key 2]
SHA1 Fingerprint: 9b:89:1a:2e:13:e6:4c:69:3c:39:98:42:28:cb:b8:91:af:40:a2:50
Error 2: Version Mismatch
nvidia-kernel-dkms : Depends: firmware-nvidia-gsp (= 535.183.01) or
firmware-nvidia-gsp-535.183.01
Solution:
1. Remove all NVIDIA packages:
apt remove --purge *nvidia*
apt autoremove
- Install specific versions:
apt install nvidia-driver=535.183.01-1~deb12u1 firmware-nvidia-gsp=535.183.01-1~deb12u1
Error 3: Module Loading Failure
If modules fail to load after all steps, verify module presence:
find /lib/modules/$(uname -r) -name "nvidia*.ko"
Expected output:
/lib/modules/6.1.0-28-amd64/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-modeset.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-peermem.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-uvm.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-drm.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current.ko
Verification
1. Check Driver Loading
dmesg | tail
Success indicators:
nvidia-nvlink: Nvlink Core is being initialized, major device number 243
nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.183.01
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
[drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
2. Final Verification
nvidia-smi
Successful output:
Thu Nov 28 17:01:53 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3050 On | 00000000:01:00.0 Off | N/A |
| 53% 49C P8 N/A / 115W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Common Issues and Solutions
- No keyboard in UEFI: Enable VNC console in VM config
- Missing MOK screen: Use
mokutil --resetto force MOK management screen - Wrong driver version: Always match driver and firmware versions exactly
- Module signing fails: Regenerate keys and verify proper enrollment
- GPU not detected: Check IOMMU groups and VFIO binding
Performance Optimization
- CPU Pinning (add to VM config):
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,+kvm_asyncpf,+kvm_steal_time,+kvm_pv_tlb_flush'
- Memory settings:
memory: 32768
balloon: 0
- Hugepages (if needed):
memory: 32768,hugepages=1
Monitoring and Maintenance
- Monitor GPU status:
watch -n 1 nvidia-smi
- Check driver logs:
journalctl -fu nvidia-persistenced
- Monitor VFIO events:
journalctl -k | grep -i vfio
Additional Resources
- Log files to check for issues:
/var/log/syslog/var/log/dmesg/var/log/Xorg.0.logImportant commands for debugging:
lspci -vvv
dmesg | grep -i nvidia
ls -l /dev/nvidia*
This guide represents real-world implementation experience and common issues encountered during setup. Each section has been tested and verified with actual hardware.
Latest Comments
Sign in to add a commentNo comments yet. Be the first to comment!