同事反馈某aigc应用调用节点gpu时报错,其中关键字是
nvml error: driver/library version mismatch: unknown"
同事反馈,周一重装驱动时是正常使用的,咋突然驱动不匹配了呢
排障思路
1,定位详细日志
根据节点信息worker-node-2,翻一下 11:13 前后的 kubelet日志
failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: driver/library version mismatch: unknown"
2,检查 nvidia-smi 命令输出
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.107
发现和前面的日志报错对上了
3, 检查 NVRM 信息
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
GCC version: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
前面2项输出结果可以发现 ,
NVML library version: 550.107
和 NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.54.14
对不上,也就出现了 如下的报错,确实是驱动不匹配
Failed to initialize NVML: Driver/library version mismatch
3, 确认一下物理卡信息
lspci |grep -i nvidia
13:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
4,确认卡型号
根据前面得到的 26b9 的这个id。
打开这个网址,搜索确认一下
https://admin.pci-ids.ucw.cz/read/PC/ 返回结果,得到卡型号 AD102GL [L40S]
5, 查找正确的驱动
通过卡型号 l40s ,我们在英伟达官网查询 https://www.nvidia.cn/drivers/lookup/ ,我们发现这是一块数据中心的卡,不是常规家用卡。
符合前面
Kernel Module 550.54.14
版本内核的驱动是这个修复方法
1,卸载现有驱动,重新安装
sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove nvidia-*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*
直到下面的命令不输出任何内容,表示卸载成功
dpkg --list | grep nvidia-*
2,下载安装新驱动
Chmod +x NVIDIA-Linux-x86_64-550.54.14.run
./NVIDIA-Linux-x86_64-550.54.14.run
一路执行,安装通过
再次检查
nvidia-smi
输出,能返回正确的卡信息.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
nvidia-smi Fri Sep 6 13:38:00 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40S Off | 00000000:03:00.0 Off | 0 | | N/A 26C P8 32W / 350W | 0MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA L40S Off | 00000000:03:01.0 Off | 0 | | N/A 27C P8 33W / 350W | 0MiB / 46068MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ |
请同事测试应用,gpu应用运行正常,问题解决。
后记
回到同事的疑惑,周一驱动还正常,咋周五出现问题了
查了下dpkg的日志(可惜不全),但是还能看到周五凌晨,有nvidia驱动自动更新的痕迹
1 2 3 4 5 6 7 8 9 |
cat /var/log/dpkg.log|grep nvidia 2024-09-06 06:30:55 upgrade libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1 550.107.02-0ubuntu0.22.04.1 2024-09-06 06:30:55 status half-configured libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1 2024-09-06 06:30:55 status unpacked libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1 2024-09-06 06:30:55 status half-installed libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1 2024-09-06 06:30:56 status unpacked libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1 2024-09-06 06:30:56 configure libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1 <none> 2024-09-06 06:30:56 status unpacked libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1 2024-09-06 06:30:56 status half-configured libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1 |
查看系统日志,周五凌晨6点半无人值守更新了驱动 :( 这么说还得停用这个自动更新呀。
1 2 3 4 5 6 |
cat /var/log/apt/history.log Start-Date: 2024-09-06 06:30:55 Commandline: /usr/bin/unattended-upgrade Upgrade: libnvidia-compute-550:amd64 (550.90.07-0ubuntu0.22.04.1, 550.107.02-0ubuntu0.22.04.1) End-Date: 2024-09-06 06:30:56 |
d 还有这里
1 2 3 4 5 6 7 8 |
cat /var/log/apt/term.log Log started: 2024-09-06 06:30:55 (Reading database ... 139996 files and directories currently installed.) Preparing to unpack .../libnvidia-compute-550_550.107.02-0ubuntu0.22.04.1_amd64.deb ... Unpacking libnvidia-compute-550:amd64 (550.107.02-0ubuntu0.22.04.1) over (550.90.07-0ubuntu0.22.04.1) ... Setting up libnvidia-compute-550:amd64 (550.107.02-0ubuntu0.22.04.1) ... Processing triggers for libc-bin (2.35-0ubuntu3.8) ... Log ended: 2024-09-06 06:30:56 |
关闭或控制更新
1 2 3 4 5 6 7 8 9 10 11 12 13 |
systemctl status unattended-upgrades ● unattended-upgrades.service - Unattended Upgrades Shutdown Loaded: loaded (/lib/systemd/system/unattended-upgrades.service; enabled; vendor preset: enabled) Active: active (running) since Fri 2024-08-30 10:23:25 CST; 1 week 1 day ago Docs: man:unattended-upgrade(8) Main PID: 1391 (unattended-upgr) Tasks: 2 (limit: 154429) Memory: 10.8M CPU: 50ms CGroup: /system.slice/unattended-upgrades.service └─1391 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal Aug 30 10:23:25 worker-node-2 systemd[1]: Started Unattended Upgrades Shutdown. |
先禁用吧
1 2 3 4 |
systemctl disable --now unattended-upgrades Synchronizing state of unattended-upgrades.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable unattended-upgrades Removed /etc/systemd/system/multi-user.target.wants/unattended-upgrades.service. |
控制自动更新哪些软件的文件/etc/apt/apt.conf.d/50unattended-upgrades
文章评论