一个伪linux粉丝的blog

  1. 首页
  2. network
  3. 正文

biren110e with DeepSeek-R1-Distill

5 4 月, 2025 131点热度 0人点赞 0条评论

背景:

之前提到过,3月简单折腾过3种国产卡,这篇壁仞biren110e的测试其实也是当时做的,昨天补了一个简单压测数据,还是发出来吧。

准备工作

Daocloud k8s 环境
8 块 或 4块 壁仞 Biren110E (32G版本),驱动已正常安装
1, 下载 壁仞 birensupa-vllm-25.02.07-C026S001T001B12997.tar 大概 12G
2,下载 DeepSeek-R1-Distill-Llama-8B 模型文件 大概 15G , DeepSeek-R1-Distill-Qwen-32B 模型文件 大概 71G (modelscope、或 huggingface下载均可)
3,测试目标,单卡8b,4卡32b

部署

脚本

根据卡类型调整里面的参数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
kind: Deployment
apiVersion: apps/v1
metadata:
  name: deepseek-r1-8b
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-r1-8b
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: deepseek-r1-8b
    spec:
      volumes:
        - name: volume-1741680483375
          hostPath:
            path: /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B
            type: ''
        - name: volume-1741684397558
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
      containers:
        - name: container-1
          image: birensupa-vllm:25.02.07-c026s001t001b12997
          command:
            - /bin/bash
            - '-c'
          args:
            - sleep infinity & wait
          env:
            - name: BRTB_DISABLE_ZERO_WS
              value: '1'
            - name: BRTB_DISABLE_ZERO_OUTPUT_UMA
              value: '1'
            - name: BRTB_DISABLE_ZERO_OUTPUT_NUMA
              value: '1'
            - name: BRTB_DISABLE_ZERO_REORDER
              value: '1'
            - name: BR_UMD_DEBUG_P2P_ACCESS_CHECK
              value: '1'
            - name: VLLM_WORKER_MULTIPROC_METHOD
              value: spawn
          resources:
            limits:
              birentech.com/gpu: '1'
          volumeMounts:
            - name: volume-1741680483375
              mountPath: /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B
            - name: volume-1741684397558
              mountPath: /dev/shm
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      affinity: {}
      schedulerName: default-scheduler
      tolerations:
        - key: node.kubernetes.io/not-ready
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 300
        - key: node.kubernetes.io/unreachable
          operator: Exists
          effect: NoExecute
          tolerationSeconds: 300
      dnsConfig: {}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

 

测试

简单压测

容器内部测试脚本

1
2
3
4
5
python  /workspace/vllm_tools/vllm/benchmarks/benchmark_serving.py \
    --model DeepSeek-R1-Distill-Qwen-32B \
    --tokenizer /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B \
    --dataset_name random --random_input_len 4096 --random_output_len 1024   --num-prompts 64  \
    --trust-remote-code  --port 8000

 

首轮测试,效果不理想

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
============ Serving Benchmark Result ============
Successful requests:                     64
Benchmark duration (s):                  1044.75
Total input tokens:                      262144
Total generated tokens:                  49985
Request throughput (req/s):              0.06
Output token throughput (tok/s):         47.84
Total Token throughput (tok/s):          298.76
---------------Time to First Token----------------
Mean TTFT (ms):                          438997.17
Median TTFT (ms):                        419538.92
P99 TTFT (ms):                           921304.29
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          155.02
Median TPOT (ms):                        153.33
P99 TPOT (ms):                           226.29
---------------Inter-token Latency----------------
Mean ITL (ms):                           155.07
Median ITL (ms):                         124.17
P99 ITL (ms):                            255.33
==================================================

 

单卡测试
临时启动8b模型(后期可调优,改为自动启动)

1
2
3
......
临时容器内启动命令
python3 -m vllm.entrypoints.openai.api_server --served-model-name DeepSeek-R1-Distill-Qwen-32B  --model /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B  --gpu_memory_utilization=0.8 --block_size=128 --port 8000 --dtype=bfloat16 --tensor-parallel-size 1 --max-model-len 32768 --max-num-seqs 8 --max_num_batched_tokens 8192 --enable_chunked_prefill  --trust-remote-code --device=supa

 

简单请求

请求命令

1
2
3
4
5
6
7
8
curl 'https://ip:8000/v1/chat/completions'   \
-H "Content-Type: application/json" \
-H "Authorization: Bearer hello" \
-d '{
    "model": "DeepSeek-R1-Distill-Qwen-32B",
    "messages": [{"role": "user", "content": "你是谁,听说能边写诗,边做数学题,那么请顺便证明1+1=2"}],
    "temperature": 0.9
}'

 

请求返回

1
{"id":"chat-38d9d3904fe1462396781e3d17c4f632","object":"chat.completion","created":1743581032,"model":"DeepSeek-R1-Distill-Qwen-32B","choices":[{"index":0,"message":{"role":"assistant","content":"嗯,用户问我“你是谁,听说能边写诗,边做数学题,那么请顺便证明1+1=2”。首先,我需要理解用户的需求。看起来用户对我的能力有些好奇,想看看我能不能同时进行诗歌创作和数学证明。所以,我应该展示出我能够在不同领域都有能力表现。\n\n接下来,我要考虑如何回应。用户可能不仅仅是想看到诗和数学证明,而是想看看我是否有广泛的知识和应用能力。因此,我应该先写一首简短的诗,然后再进行数学证明,这样结构清晰,容易理解。\n\n在写诗的时候,我需要选择一个主题,比如自然或者日常生活,这样容易引起共鸣。比如,风、树、叶子这些元素很常见,容易描绘。然后,我要确保诗的结构流畅,有节奏感,同时传达出一种宁静或美好的感觉。\n\n关于数学证明,用户特别要求证明1+1=2,这是一个基本的数学事实,但我需要用一种简洁明了的方式来展示,可能用代数的方法或者逻辑推理。这样不仅证明了结果,还展示了思考过程。\n\n最后,我要总结一下,说明我不仅能写诗,还能进行数学证明,展示我的多功能性。这样用户就能看到我在不同领域的能力,满足他们的好奇心。\n\n在整个思考过程中,我需要确保语言自然,避免过于正式或生硬,让用户感觉到我的亲和力和专业性。同时,保持回答的结构清晰,先展示诗,再进行证明,最后总结,这样逻辑清晰,用户容易理解和接受。\n</think>\n\n好的,我可以边写诗边做数学题。首先,我来写一首诗:\n\n**风的韵律**\n\n风在树梢游荡\n带走了所有的叹息\n叶子轻轻摇晃\n像是在跳一支古老的舞\n\n现在,我来证明1+1=2。\n\n**数学证明:**\n\n我们知道,在数学中,数字是由抽象符号代表的概念。1通常代表一个单位,另一个1同样代表一个单位。将它们相加意味着将一个单位加上另一个单位。\n\n1 + 1 = 2\n\n这是因为当你将一个对象与另一个对象结合时,总数增加了一个单位。因此,1 + 1 = 2。\n\n总结来说,我不仅能够写诗,还能够进行数学证明。希望这对你有所帮助!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":31,"total_tokens":608,"completion_tokens":577},"prompt_logprobs":null}

 

测试过程中,观察容器0里面的 卡信息输出记录
基本看到单卡跑8b时 10token/s,gpu压力 94%左右
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Every 5.0s: brsmi                                                                             mgt1-10-10-2-12: Thu Apr  3 07:29:08 2025
 
Thu Apr  3 07:29:08 2025
+------------------------------------------------------------------------------+
| BR-SMI 1.6.6         Driver Version: 1.6.5       SUPA Version: N/A           |
+----------------------------+-----------------------+-------------------------+
|   GPU                 Name |                Bus-Id |    Volatile Uncorr. ECC |
|  Temp  Perf  Pwr:Usage/Cap |          Memory-Usage |    GPU-Util  Compute M. |
|              Persistence-M |                       |                  SVI M. |
+============================+=======================+=========================+
|     0            Biren110E |      00000000:2A:00.0 |                       0 |
|    51    P0      27W / 66W |   27032MiB / 32512MiB |         94%     Default |
|                        Off |                       |                Disabled |
+----------------------------+-----------------------+-------------------------+
|     1            Biren110E |      00000000:2B:00.0 |                       0 |
|    41    P0      13W / 66W |       0MiB / 32512MiB |          0%     Default |
|                        Off |                       |                Disabled |
+----------------------------+-----------------------+-------------------------+
|     2            Biren110E |      00000000:3D:00.0 |                       0 |
|    45    P0      17W / 66W |       0MiB / 32512MiB |          0%     Default |
|                        Off |                       |                Disabled |
+----------------------------+-----------------------+-------------------------+
|     3            Biren110E |      00000000:99:00.0 |                       0 |
|    38    P0      13W / 66W |       0MiB / 32512MiB |          0%     Default |
|                        Off |                       |                Disabled |
+----------------------------+-----------------------+-------------------------+
|     4            Biren110E |      00000000:9A:00.0 |                       0 |
Every 5.0s: brsmi                                                                             mgt1-10-10-2-12: Thu Apr  3 07:29:19 2025

另一种压测脚本

单次请求
1
2
3
4
5
6
7
8
9
10
nerdctl run --rm registry.cn-shanghai.aliyuncs.com/jamesxiong/model-performance:amd64-v0.1.2  python run.py \
--api_key "asd" \
--model_name "DeepSeek-R1-Distill-Qwen-32B" \
--base_url "http://10.10.2.12:32578/v1" \
--system_prompt "" \
--history '[{"role": "user", "content": "kubernetes是什么?"}]' \
--gen_conf '{"temperature": 0.01}' \
--num_requests 1 \
--print_answer "no" \
--stream "yes"

 

测试结果

输出基本是12 token/s
1
[Index]: 0, Start Time: 2025-04-02 07:49:50, End Time: 2025-04-02 07:51:32, First Token Time: 3.29s, Elapsed Time: 101.91s, Think Tokens: 443, Answer Tokens: 862, Total Tokens: 1304, Tokens per second: 12.80

20并发测试单卡8b

1
2
3
4
5
6
7
8
9
10
nerdctl run --rm registry.cn-shanghai.aliyuncs.com/jamesxiong/model-performance:amd64-v0.1.2  python run.py \
--api_key "asd" \
--model_name "DeepSeek-R1-Distill-Qwen-32B" \
--base_url "http://10.10.2.12:32578/v1" \
--system_prompt "" \
--history '[{"role": "user", "content": "kubernetes是什么?你希望上海还是北京,你能写诗么,请写一首4言绝句"}]' \
--gen_conf '{"temperature": 0.01}' \
--num_requests 20 \
--print_answer "no" \
--stream "yes"

 

20并发测试结果单卡8b

并发输出为, 高低不等,6~12 token/s 左右
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[Index]: 3, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:52:57, First Token Time: 0.43s, Elapsed Time: 43.00s, Think Tokens: 380, Answer Tokens: 167, Total Tokens: 546, Tokens per second: 12.70
[Index]: 1, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:09, First Token Time: 0.44s, Elapsed Time: 55.11s, Think Tokens: 533, Answer Tokens: 172, Total Tokens: 704, Tokens per second: 12.77
[Index]: 7, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:15, First Token Time: 0.39s, Elapsed Time: 61.22s, Think Tokens: 563, Answer Tokens: 217, Total Tokens: 779, Tokens per second: 12.73
[Index]: 5, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:17, First Token Time: 0.41s, Elapsed Time: 63.03s, Think Tokens: 554, Answer Tokens: 247, Total Tokens: 800, Tokens per second: 12.69
[Index]: 4, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:18, First Token Time: 0.42s, Elapsed Time: 64.28s, Think Tokens: 565, Answer Tokens: 252, Total Tokens: 816, Tokens per second: 12.69
[Index]: 2, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:19, First Token Time: 0.44s, Elapsed Time: 64.99s, Think Tokens: 563, Answer Tokens: 270, Total Tokens: 832, Tokens per second: 12.80
[Index]: 6, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:19, First Token Time: 0.39s, Elapsed Time: 65.18s, Think Tokens: 560, Answer Tokens: 266, Total Tokens: 825, Tokens per second: 12.66
[Index]: 0, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:24, First Token Time: 0.23s, Elapsed Time: 70.85s, Think Tokens: 687, Answer Tokens: 202, Total Tokens: 888, Tokens per second: 12.53
[Index]: 9, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:01, First Token Time: 43.05s, Elapsed Time: 107.38s, Think Tokens: 479, Answer Tokens: 341, Total Tokens: 819, Tokens per second: 7.63
[Index]: 12, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:12, First Token Time: 64.98s, Elapsed Time: 118.46s, Think Tokens: 526, Answer Tokens: 165, Total Tokens: 690, Tokens per second: 5.82
[Index]: 8, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:13, First Token Time: 55.16s, Elapsed Time: 119.65s, Think Tokens: 566, Answer Tokens: 262, Total Tokens: 827, Tokens per second: 6.91
[Index]: 10, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:15, First Token Time: 61.30s, Elapsed Time: 121.09s, Think Tokens: 538, Answer Tokens: 227, Total Tokens: 764, Tokens per second: 6.31
[Index]: 15, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:18, First Token Time: 65.18s, Elapsed Time: 124.47s, Think Tokens: 541, Answer Tokens: 214, Total Tokens: 754, Tokens per second: 6.06
[Index]: 13, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:20, First Token Time: 64.29s, Elapsed Time: 126.23s, Think Tokens: 550, Answer Tokens: 236, Total Tokens: 785, Tokens per second: 6.22
[Index]: 11, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:21, First Token Time: 63.08s, Elapsed Time: 127.33s, Think Tokens: 565, Answer Tokens: 262, Total Tokens: 826, Tokens per second: 6.49
[Index]: 14, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:29, First Token Time: 70.80s, Elapsed Time: 134.98s, Think Tokens: 552, Answer Tokens: 278, Total Tokens: 829, Tokens per second: 6.14
[Index]: 19, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:04, First Token Time: 119.62s, Elapsed Time: 170.55s, Think Tokens: 497, Answer Tokens: 180, Total Tokens: 676, Tokens per second: 3.96
[Index]: 17, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:09, First Token Time: 121.12s, Elapsed Time: 174.78s, Think Tokens: 508, Answer Tokens: 242, Total Tokens: 749, Tokens per second: 4.29
[Index]: 16, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:10, First Token Time: 107.41s, Elapsed Time: 176.52s, Think Tokens: 608, Answer Tokens: 298, Total Tokens: 905, Tokens per second: 5.13
[Index]: 18, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:14, First Token Time: 118.51s, Elapsed Time: 179.85s, Think Tokens: 565, Answer Tokens: 267, Total Tokens: 831, Tokens per second: 4.62

有空再调优后测试吧

相关文章:

  1. FavShare: 20GB 容量的免费在线相册
  2. 腾讯发布 QQ for linux 1.0 Preview
  3. MySQL配置文件说明-转
  4. Picasa3 Beta抢鲜评测
标签: biren110e 国产卡 壁仞 蒸馏模型
最后更新:5 4 月, 2025

wanjie

这个人很懒,什么都没留下

点赞
< 上一篇

文章评论

razz evil exclaim smile redface biggrin eek confused idea lol mad twisted rolleyes wink cool arrow neutral cry mrgreen drooling persevering
取消回复

This site uses Akismet to reduce spam. Learn how your comment data is processed.

归档
分类
  • network / 332篇
  • Uncategorized / 116篇
  • unix/linux / 121篇
  • 业界资讯 / 38篇
  • 公司杂事 / 11篇
  • 数码影像 / 12篇
  • 美剧 / 3篇
  • 美图共赏 / 21篇
  • 英语学习 / 3篇
标签聚合
泰国 Ubuntu docker openssl nexus google-chrome d90 gitlab deepseek Nginx postgres jira 浏览器 Google kernel dreamhost空间 ssh 网站运营 debian wget Google Voice 虚拟主机 k8s squid dreamhost VPS ldap 天翼live kubectl 邮件归档

COPYRIGHT © 2008-2025 wanjie.info. ALL RIGHTS RESERVED.

Theme Kratos Made By Seaton Jiang