一个伪linux粉丝的blog

  1. 首页
  2. unix/linux
  3. 正文

How to solve NVML error: driver/library version mismatch

7 9 月, 2024 697点热度 0人点赞 0条评论

背景

同事反馈某aigc应用调用节点gpu时报错,其中关键字是

nvml error: driver/library version mismatch: unknown"
同事反馈,周一重装驱动时是正常使用的,咋突然驱动不匹配了呢

排障思路

1,定位详细日志

根据节点信息worker-node-2,翻一下 11:13 前后的 kubelet日志

failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: nvml error: driver/library version mismatch: unknown"

2,检查 nvidia-smi 命令输出

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.107
发现和前面的日志报错对上了

3, 检查 NVRM 信息

cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.54.14 Thu Feb 22 01:44:30 UTC 2024
GCC version: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
前面2项输出结果可以发现 ,NVML library version: 550.107 和 NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.54.14
对不上,也就出现了 如下的报错,确实是驱动不匹配
Failed to initialize NVML: Driver/library version mismatch

3, 确认一下物理卡信息

lspci |grep -i nvidia
13:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)

4,确认卡型号

根据前面得到的 26b9 的这个id。
打开这个网址,搜索确认一下
https://admin.pci-ids.ucw.cz/read/PC/ 返回结果,得到卡型号 AD102GL [L40S]

5, 查找正确的驱动

通过卡型号 l40s ,我们在英伟达官网查询 https://www.nvidia.cn/drivers/lookup/ ,我们发现这是一块数据中心的卡,不是常规家用卡。
符合前面Kernel Module 550.54.14 版本内核的驱动是这个
https://www.nvidia.cn/drivers/details/221555/

修复方法

1,卸载现有驱动,重新安装

sudo /usr/bin/nvidia-uninstall
sudo apt-get --purge remove nvidia-*
sudo apt-get purge nvidia*
sudo apt-get purge libnvidia*
直到下面的命令不输出任何内容,表示卸载成功
dpkg --list | grep nvidia-*

2,下载安装新驱动

Wget -c https://cn.download.nvidia.cn/tesla/550.54.14/NVIDIA-Linux-x86_64-550.54.14.run
Chmod +x NVIDIA-Linux-x86_64-550.54.14.run
./NVIDIA-Linux-x86_64-550.54.14.run
一路执行,安装通过
再次检查 nvidia-smi 输出,能返回正确的卡信息.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
nvidia-smi
Fri Sep  6 13:38:00 2024      
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:03:00.0 Off |                    0 |
| N/A   26C    P8             32W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:03:01.0 Off |                    0 |
| N/A   27C    P8             33W /  350W |       0MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

请同事测试应用,gpu应用运行正常,问题解决。

后记

回到同事的疑惑,周一驱动还正常,咋周五出现问题了

查了下dpkg的日志(可惜不全),但是还能看到周五凌晨,有nvidia驱动自动更新的痕迹

1
2
3
4
5
6
7
8
9
cat /var/log/dpkg.log|grep nvidia
2024-09-06 06:30:55 upgrade libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1 550.107.02-0ubuntu0.22.04.1
2024-09-06 06:30:55 status half-configured libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1
2024-09-06 06:30:55 status unpacked libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1
2024-09-06 06:30:55 status half-installed libnvidia-compute-550:amd64 550.90.07-0ubuntu0.22.04.1
2024-09-06 06:30:56 status unpacked libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1
2024-09-06 06:30:56 configure libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1 <none>
2024-09-06 06:30:56 status unpacked libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1
2024-09-06 06:30:56 status half-configured libnvidia-compute-550:amd64 550.107.02-0ubuntu0.22.04.1

查看系统日志,周五凌晨6点半无人值守更新了驱动 :( 这么说还得停用这个自动更新呀。

1
2
3
4
5
6
cat /var/log/apt/history.log
 
Start-Date: 2024-09-06  06:30:55
Commandline: /usr/bin/unattended-upgrade
Upgrade: libnvidia-compute-550:amd64 (550.90.07-0ubuntu0.22.04.1, 550.107.02-0ubuntu0.22.04.1)
End-Date: 2024-09-06  06:30:56

d 还有这里

1
2
3
4
5
6
7
8
cat /var/log/apt/term.log
Log started: 2024-09-06  06:30:55
(Reading database ... 139996 files and directories currently installed.)
Preparing to unpack .../libnvidia-compute-550_550.107.02-0ubuntu0.22.04.1_amd64.deb ...
Unpacking libnvidia-compute-550:amd64 (550.107.02-0ubuntu0.22.04.1) over (550.90.07-0ubuntu0.22.04.1) ...
Setting up libnvidia-compute-550:amd64 (550.107.02-0ubuntu0.22.04.1) ...
Processing triggers for libc-bin (2.35-0ubuntu3.8) ...
Log ended: 2024-09-06  06:30:56

关闭或控制更新

1
2
3
4
5
6
7
8
9
10
11
12
13
systemctl status unattended-upgrades
● unattended-upgrades.service - Unattended Upgrades Shutdown
     Loaded: loaded (/lib/systemd/system/unattended-upgrades.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2024-08-30 10:23:25 CST; 1 week 1 day ago
       Docs: man:unattended-upgrade(8)
   Main PID: 1391 (unattended-upgr)
      Tasks: 2 (limit: 154429)
     Memory: 10.8M
        CPU: 50ms
     CGroup: /system.slice/unattended-upgrades.service
             └─1391 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal
 
Aug 30 10:23:25 worker-node-2 systemd[1]: Started Unattended Upgrades Shutdown.

先禁用吧

1
2
3
4
systemctl disable --now unattended-upgrades
Synchronizing state of unattended-upgrades.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install disable unattended-upgrades
Removed /etc/systemd/system/multi-user.target.wants/unattended-upgrades.service.

 

控制自动更新哪些软件的文件/etc/apt/apt.conf.d/50unattended-upgrades

 

相关文章:

  1. Linux下多线程下载工具
  2. MySQL配置文件说明-转
  3. cron导致系统故障?+ moodle备份
  4. Xfce 4.6 Final Released
标签: nvidia nvml unattended-upgrades
最后更新:8 9 月, 2024

wanjie

这个人很懒,什么都没留下

点赞
< 上一篇
下一篇 >

文章评论

razz evil exclaim smile redface biggrin eek confused idea lol mad twisted rolleyes wink cool arrow neutral cry mrgreen drooling persevering
取消回复

This site uses Akismet to reduce spam. Learn how your comment data is processed.

归档
分类
  • network / 332篇
  • Uncategorized / 116篇
  • unix/linux / 121篇
  • 业界资讯 / 38篇
  • 公司杂事 / 11篇
  • 数码影像 / 12篇
  • 美剧 / 3篇
  • 美图共赏 / 21篇
  • 英语学习 / 3篇
标签聚合
虚拟主机 kubectl Nginx dreamhost ldap 泰国 gitlab debian postgres docker Google Voice openssl k8s d90 天翼live google-chrome squid kernel Google jira VPS 浏览器 dreamhost空间 网站运营 nexus ssh 邮件归档 deepseek Ubuntu wget

COPYRIGHT © 2008-2025 wanjie.info. ALL RIGHTS RESERVED.

Theme Kratos Made By Seaton Jiang