一个伪linux粉丝的blog

  1. 首页
  2. network
  3. 正文

A Record of the Troubles Encountered During the Installation of the H20 Graphics Card Driver

23 2 月, 2025 500点热度 1人点赞 0条评论

背景:

春节假期结束后,各项工作主要围绕 DeepSeek-R1 各版本的本地部署展开,比往年更忙一些。同时,还对 ComfyUI 和 Janus_pro 提供简单支持,和之前经验一样,大显存,大虚拟内存才是王道。好在春节期间已有相关经验,因此整体进展还算顺利。然而,一个遗留问题始终困扰着我:某物理节点的显卡是 H20,但之前通过 gpu-operator 安装的驱动版本为 535-5.15.0-88,显卡被识别为 NVIDIA-Graphics-Device,而非 H20。尽管应用仍能运行,但总感觉不太对劲。我本以为周末花一小时更新驱动就能轻松解决,没想到却掉进了一个大坑。

过程:

问题初现

这里主要记录过程中的报错及解决办法

第一轮,直接尝试了直接安装最新版本的 CUDA Toolkit

当前最新 https://developer.nvidia.com/cuda-toolkit-archive
cuda_12.6.0_560.28.03_linux.run
安装时遇到典型的告警
1
WARNING: An NVIDIA kernel module 'nvidia-uvm' appears to be already loaded in your kernel.
解决这个比较直接:停止节点调度,停止 containerd 、kubelet
尝试 rmmod  nvidia-uvm  提示n多个进程被占用
强制删除几轮
1
lsof -w /dev/nvidia* |awk '{print $2}'|xargs kill -9
后面还有 nvidia模块 之类。
然后,继续安装 cuda_12.6.0_560.28.03_linux.run,安装完成,显卡型号H20 识别正常。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
nvidia-smi
Sat Feb 22 12:00:29 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H20                     Off |   00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0            113W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H20                     Off |   00000000:34:00.0 Off |                    0 |
| N/A   29C    P0            111W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H20                     Off |   00000000:48:00.0 Off |                    0 |
| N/A   32C    P0            116W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H20                     Off |   00000000:5A:00.0 Off |                    0 |
| N/A   30C    P0            111W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H20                     Off |   00000000:87:00.0 Off |                    0 |
| N/A   31C    P0            117W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H20                     Off |   00000000:AE:00.0 Off |                    0 |
| N/A   28C    P0            111W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H20                     Off |   00000000:C2:00.0 Off |                    0 |
| N/A   31C    P0            114W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H20                     Off |   00000000:D7:00.0 Off |                    0 |
| N/A   29C    P0            116W /  500W |       1MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
 
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

问题来了, gpu-opearator 下 nvidia-cuda-validator-**反复重启

日志类似
Failed to allocate device vector A (error code system not yet initialized)!
[Vector addition of 50000 elements]
后面陆续试了几个 cuda 工具集,但 pod依旧报错。没办法,我只能拉上同事一起排查问题。

深入排查

我们首先试对着 CUDA Toolkit 官方文档 查看兼容性,发现当前版本的 CUDA Toolkit 并不完全支持 H20 。
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
然后看了一下几个版本 cuda的  gpu表
/usr/local/cuda-12.8/bin/nvcc --list-gpu-arch
/usr/local/cuda-12.6/bin/nvcc --list-gpu-arch
/usr/local/cuda-12.4/bin/nvcc --list-gpu-arch

除开 12.8 版本支持了 100及以上compute_100、compute_101、compute_102

其他版本只支持到 compute_90 ,我理解为版本较低的 nvcc 不支持 H20卡。
然后检查驱动 https://www.nvidia.com/en-us/drivers/
Recommended/Certified  H20 仅有4个版本认证过。
  • Driver Version: 550.144.03,CUDA Toolkit: 12.4,Release Date: Thu Jan 16, 2025
  • Driver Version: 535.230.02,CUDA Toolkit: 12.2,Release Date: Thu Jan 16, 2025
  • Driver Version: 550.127.08,CUDA Toolkit: 12.4,Release Date: Tue Nov 19, 2024
  • Driver Version: 535.216.03,CUDA Toolkit: 12.2,Release Date: Tue Nov 19, 2024
到这里发现踩坑严重了, 这让我意识到,之前尝试的驱动版本可能并不适合 H20 显卡。

验证与解决

为了进一步验证问题,同事建议我在节点上编译并运行 vectorAdd 示例程序。我们先后尝试了 1.24 和 1.28 两个版本的 CUDA 样例,结果都报错:
https://github.com/NVIDIA/cuda-samples
报错大概如下
1
2
3
4
5
~/124/cuda-samples-12.4/bin/x86_64/linux/release
./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
或者报错  Failed to allocate device vector A (error code system not yet initialized)!

正常日志应该是这样
1
2
3
4
5
6
7
./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
这进一步确认了问题是由于驱动版本与 CUDA 运行时版本不匹配导致的,与 k8s 环境无关。
另外,我们还注意到内核日志中有以下报错:
1
2
[Sat Feb 22 15:25:54 2025] NVRM: _knvlinkCheckFabricCliqueId: GPU 0 failed to get fabric clique Id: 0x55
[Sat Feb 22 15:25:54 2025] NVRM: _knvlinkCheckFabricCliqueId: GPU 1 failed to get fabric clique Id: 0x55

机房的同事指出,这是因为没有安装 NVIDIA-fabricmanager。

解决:

有了这些线索,问题的解决思路逐渐清晰。我们首先按照官方推荐的方式卸载了之前安装的各种驱动:

/usr/local/cuda-X.Y/bin/cuda-uninstaller
发现可以一次全选
1
2
3
4
5
6
7
8
9
10
11
12
13
CUDA Uninstaller                                                               │
│   [X] CUDA_Toolkit_12.5                                                      │
│   [X] CUDA_Demo_Suite_12.6                                                   │
│   [X] CUDA_Toolkit_12.4                                                      │
│   [X] CUDA_Demo_Suite_12.5                                                   │
│   [X] CUDA_Documentation_12.5                                                │
│   [X] CUDA_Demo_Suite_12.8                                                   │
│   [X] CUDA_Demo_Suite_12.4                                                   │
│   [X] CUDA_Documentation_12.8                                                │
│   [X] CUDA_Documentation_12.6                                                │
│   [X] CUDA_Documentation_12.4                                                │
│   [X] CUDA_Toolkit_12.6                                                      │
│   [X] CUDA_Toolkit_12.8

然后,我们根据 NVIDIA 官方驱动下载页面 和 CUDA Toolkit 官方文档 确定了 H20 显卡支持的 CUDA 和驱动版本。最终选择了 cuda_12.4.1_550.54.15_linux.run, 并搭配认证过的 NVIDIA-Linux-x86_64-550.144.03.run 驱动进行安装。

安装完成后,我们还安装了 nvidia-fabricmanager 和 nvidia-compute-utils:版本一定要配对。

1
2
3
apt install nvidia-fabricmanager-550 nvidia-fabricmanager-dev-550
apt install nvidia-compute-utils-550  -y
systemctl enable --now nvidia-fabricmanager (如果版本对不上,这里会没法启动)

https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html

最后,我们临时启动了 nvidia-persistenced 服务:nvidia-persistenced --user root

或者 nvidia-smi -pm 1
为了确保服务开机自启,我们还需要修改 nvidia-persistenced.service 文件,但这可以留到下次再处理。
经过这些操作后,我们再次执行 vectorAdd,输出结果正常了。开启节点调度,发现 gpu-operator 下的 pod 也恢复正常。至此,问题终于得到了解决,我终于松了一口气。
这次经历让我深刻体会到,安装驱动也不是那么简单,官方给的套件也许会有坑,一定要多方匹配校验。

相关文章:

  1. 基于Ubuntu Nginx Mongrel Mysql部署rails
  2. 谁是你在虚拟主机上的邻居呢?
  3. 国外虚拟主机购买指南top10对比
  4. 如何捕获扫描无线信号?
标签: h20 nvidia-persistenced nvidia-smi vectoradd
最后更新:24 2 月, 2025

wanjie

这个人很懒,什么都没留下

点赞
< 上一篇
下一篇 >

文章评论

razz evil exclaim smile redface biggrin eek confused idea lol mad twisted rolleyes wink cool arrow neutral cry mrgreen drooling persevering
取消回复

This site uses Akismet to reduce spam. Learn how your comment data is processed.

归档
分类
  • network / 332篇
  • Uncategorized / 116篇
  • unix/linux / 121篇
  • 业界资讯 / 38篇
  • 公司杂事 / 11篇
  • 数码影像 / 12篇
  • 美剧 / 3篇
  • 美图共赏 / 21篇
  • 英语学习 / 3篇
标签聚合
Google d90 nexus ssh 虚拟主机 dreamhost 泰国 gitlab squid kubectl Nginx k8s Ubuntu 网站运营 postgres Google Voice 浏览器 dreamhost空间 VPS kernel ldap docker 天翼live openssl deepseek jira 邮件归档 google-chrome debian wget

COPYRIGHT © 2008-2025 wanjie.info. ALL RIGHTS RESERVED.

Theme Kratos Made By Seaton Jiang