by danduran on Development , AI Technology 5 min read, Comments: 0 (Add Your Comment!)

NVIDIA GPU Passthrough to Debian VM on Proxmox

TL;DR:

This guide documents the complete process of setting up NVIDIA GPU passthrough to a Debian VM on Proxmox, including detailed troubleshooting steps and real error resolution.

Overview

This guide documents the complete process of setting up NVIDIA GPU passthrough to a Debian VM on Proxmox, including detailed troubleshooting steps and real error resolution. This is based on actual implementation experience with an RTX 3050 GPU.

Initial Setup Verification (ProxMox)

1. Host System Requirements Check

First, verify your IOMMU setup:

dmesg | grep -i iommu

Expected output should show IOMMU enabled:

[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt
[    0.537426] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.12-2-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on iommu=pt
[    1.445510] DMAR-IR: IOAPIC id 8 under DRHD base  0xfbffc000 IOMMU 1
[    1.445512] DMAR-IR: IOAPIC id 9 under DRHD base  0xfbffc000 IOMMU 1
[    4.824723] iommu: Default domain type: Passthrough (set via kernel command line)

If you don't see this, add to /etc/default/grub:

# For AMD:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
# For Intel:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

2. GPU Identification and VFIO Setup

Check your GPU details:

lspci -nnk | grep -A 3 -i nvidia

Real output example:

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA107 [GeForce RTX 3050 8GB] [10de:2582] (rev a1)
        Subsystem: ASUSTeK Computer Inc. GA107 [GeForce RTX 3050 8GB] [1043:8890]
        Kernel driver in use: vfio-pci
        Kernel modules: nvidiafb, nouveau
04:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:2291] (rev a1)
        Subsystem: ASUSTeK Computer Inc. Device [1043:8890]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

Critical info to note:
- GPU ID: [10de:2582]
- Audio ID: [10de:2291]
- Current driver: should be vfio-pci

VM Configuration

1. Base VM Setup

Essential VM configuration (/etc/pve/qemu-server/<VM Number>.conf):

bios: ovmf
boot: order=scsi0;ide2;net0
cores: 36
cpu: host
efidisk0: secondary-repo:116/vm-116-disk-0.qcow2,efitype=4m,pre-enrolled-keys=1,size=528K
machine: q35
memory: 32768
scsihw: virtio-scsi-single

2. GPU Passthrough Configuration

Add these lines:

hostpci0: 04:00.0,pcie=1,x-vga=1
hostpci1: 04:00.1,pcie=1

3. VNC Console Setup (Critical for UEFI access)

vga: qxl
args: -vnc 0.0.0.0:0

Detailed Driver Installation Process

1. Repository Setup

Edit /etc/apt/sources.list:

deb http://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware

2. Initial Driver Installation

apt update
apt install linux-headers-$(uname -r)
apt install nvidia-driver firmware-misc-nonfree

3. Version Alignment (Critical)

Check available versions:

apt-cache policy firmware-nvidia-gsp

Example output:

firmware-nvidia-gsp:
  Installed: 535.183.06-1~bpo12+1
  Candidate: 535.183.06-1~bpo12+1
  Version table:
 *** 535.183.06-1~bpo12+1 100
        100 http://deb.debian.org/debian bookworm-backports/non-free-firmware amd64 Packages
     535.183.01-1~deb12u1 500
        500 http://deb.debian.org/debian bookworm/non-free-firmware amd64 Packages

4. Version-Specific Installation

apt install nvidia-driver=535.183.01-1~deb12u1 firmware-nvidia-gsp=535.183.01-1~deb12u1

Secure Boot Configuration

1. Key Generation

openssl req -new -x509 -newkey rsa:2048 -keyout /root/MOK.priv -outform DER -out /root/MOK.der -nodes -days 36500 -subj "/CN=NVIDIA_KEY/"
chmod 600 /root/MOK.priv /root/MOK.der

2. Key Enrollment

mokutil --import /root/MOK.der

3. UEFI Key Management

Step-by-step enrollment process:
1. Reboot system
2. At MOK management screen, select "Enroll key from disk"
3. Navigate through filesystem:
- Select first PciRoot option
- Select "debian"
- Navigate to EFI directory
- Select the MOK key file
4. Verify key details match:

[Serial Number]
35:1F:EC:11:12:28:CF:7B:2E:94:B4:C4:25:D< REDACTED >

[Issuer]
CN=NVIDIA_KEY

[Subject]
CN=NVIDIA_KEY
  1. Select "Continue" and "Yes" to enroll

Real-World Troubleshooting

Error 1: Key Rejection

modprobe: ERROR: could not insert 'nvidia_current': Key was rejected by service

Solution steps:
1. Verify key enrollment:

mokutil --list-enrolled
  1. Check if key is properly enrolled (should show two keys):
[key 1]
SHA1 Fingerprint: 53:61:0c:f8:1f:bd:7e:0c:eb:67:91:3c:9e:f3:e7:94:a9:63:3e:cb
[key 2]
SHA1 Fingerprint: 9b:89:1a:2e:13:e6:4c:69:3c:39:98:42:28:cb:b8:91:af:40:a2:50

Error 2: Version Mismatch

nvidia-kernel-dkms : Depends: firmware-nvidia-gsp (= 535.183.01) or
                             firmware-nvidia-gsp-535.183.01

Solution:
1. Remove all NVIDIA packages:

apt remove --purge *nvidia*
apt autoremove
  1. Install specific versions:
apt install nvidia-driver=535.183.01-1~deb12u1 firmware-nvidia-gsp=535.183.01-1~deb12u1

Error 3: Module Loading Failure

If modules fail to load after all steps, verify module presence:

find /lib/modules/$(uname -r) -name "nvidia*.ko"

Expected output:

/lib/modules/6.1.0-28-amd64/kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-modeset.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-peermem.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-uvm.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current-drm.ko
/lib/modules/6.1.0-28-amd64/updates/dkms/nvidia-current.ko

Verification

1. Check Driver Loading

dmesg | tail

Success indicators:

nvidia-nvlink: Nvlink Core is being initialized, major device number 243
nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.183.01
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
[drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1

2. Final Verification

nvidia-smi

Successful output:

Thu Nov 28 17:01:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050        On  | 00000000:01:00.0 Off |                  N/A |
| 53%   49C    P8              N/A / 115W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Common Issues and Solutions

  1. No keyboard in UEFI: Enable VNC console in VM config
  2. Missing MOK screen: Use mokutil --reset to force MOK management screen
  3. Wrong driver version: Always match driver and firmware versions exactly
  4. Module signing fails: Regenerate keys and verify proper enrollment
  5. GPU not detected: Check IOMMU groups and VFIO binding

Performance Optimization

  1. CPU Pinning (add to VM config):
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,+kvm_asyncpf,+kvm_steal_time,+kvm_pv_tlb_flush'
  1. Memory settings:
memory: 32768
balloon: 0
  1. Hugepages (if needed):
memory: 32768,hugepages=1

Monitoring and Maintenance

  1. Monitor GPU status:
watch -n 1 nvidia-smi
  1. Check driver logs:
journalctl -fu nvidia-persistenced
  1. Monitor VFIO events:
journalctl -k | grep -i vfio

Additional Resources

  1. Log files to check for issues:
  2. /var/log/syslog
  3. /var/log/dmesg
  4. /var/log/Xorg.0.log

  5. Important commands for debugging:

lspci -vvv
dmesg | grep -i nvidia
ls -l /dev/nvidia*

This guide represents real-world implementation experience and common issues encountered during setup. Each section has been tested and verified with actual hardware.

No comments yet. Be the first to comment!