背景:
之前提到过,3月简单折腾过3种国产卡,这篇壁仞biren110e的测试其实也是当时做的,昨天补了一个简单压测数据,还是发出来吧。
准备工作
Daocloud k8s 环境
8 块 或 4块 壁仞 Biren110E (32G版本),驱动已正常安装
1, 下载 壁仞 birensupa-vllm-25.02.07-C026S001T001B12997.tar 大概 12G
2,下载 DeepSeek-R1-Distill-Llama-8B 模型文件 大概 15G , DeepSeek-R1-Distill-Qwen-32B 模型文件 大概 71G (modelscope、或 huggingface下载均可)
3,测试目标,单卡8b,4卡32b
部署
脚本
根据卡类型调整里面的参数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
kind: Deployment apiVersion: apps/v1 metadata: name: deepseek-r1-8b namespace: default spec: replicas: 1 selector: matchLabels: app: deepseek-r1-8b template: metadata: creationTimestamp: null labels: app: deepseek-r1-8b spec: volumes: - name: volume-1741680483375 hostPath: path: /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B type: '' - name: volume-1741684397558 emptyDir: medium: Memory sizeLimit: 16Gi containers: - name: container-1 image: birensupa-vllm:25.02.07-c026s001t001b12997 command: - /bin/bash - '-c' args: - sleep infinity & wait env: - name: BRTB_DISABLE_ZERO_WS value: '1' - name: BRTB_DISABLE_ZERO_OUTPUT_UMA value: '1' - name: BRTB_DISABLE_ZERO_OUTPUT_NUMA value: '1' - name: BRTB_DISABLE_ZERO_REORDER value: '1' - name: BR_UMD_DEBUG_P2P_ACCESS_CHECK value: '1' - name: VLLM_WORKER_MULTIPROC_METHOD value: spawn resources: limits: birentech.com/gpu: '1' volumeMounts: - name: volume-1741680483375 mountPath: /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B - name: volume-1741684397558 mountPath: /dev/shm terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent securityContext: privileged: true restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst securityContext: {} affinity: {} schedulerName: default-scheduler tolerations: - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 300 dnsConfig: {} strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 25% revisionHistoryLimit: 10 progressDeadlineSeconds: 600 |
测试
简单压测
容器内部测试脚本
1 2 3 4 5 |
python /workspace/vllm_tools/vllm/benchmarks/benchmark_serving.py \ --model DeepSeek-R1-Distill-Qwen-32B \ --tokenizer /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B \ --dataset_name random --random_input_len 4096 --random_output_len 1024 --num-prompts 64 \ --trust-remote-code --port 8000 |
首轮测试,效果不理想
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
============ Serving Benchmark Result ============ Successful requests: 64 Benchmark duration (s): 1044.75 Total input tokens: 262144 Total generated tokens: 49985 Request throughput (req/s): 0.06 Output token throughput (tok/s): 47.84 Total Token throughput (tok/s): 298.76 ---------------Time to First Token---------------- Mean TTFT (ms): 438997.17 Median TTFT (ms): 419538.92 P99 TTFT (ms): 921304.29 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 155.02 Median TPOT (ms): 153.33 P99 TPOT (ms): 226.29 ---------------Inter-token Latency---------------- Mean ITL (ms): 155.07 Median ITL (ms): 124.17 P99 ITL (ms): 255.33 ================================================== |
单卡测试
临时启动8b模型(后期可调优,改为自动启动)
1 2 3 |
...... 临时容器内启动命令 python3 -m vllm.entrypoints.openai.api_server --served-model-name DeepSeek-R1-Distill-Qwen-32B --model /br_data/model/model_weitht/DeepSeek-R1-Distill-Llama-8B --gpu_memory_utilization=0.8 --block_size=128 --port 8000 --dtype=bfloat16 --tensor-parallel-size 1 --max-model-len 32768 --max-num-seqs 8 --max_num_batched_tokens 8192 --enable_chunked_prefill --trust-remote-code --device=supa |
简单请求
请求命令
1 2 3 4 5 6 7 8 |
curl 'https://ip:8000/v1/chat/completions' \ -H "Content-Type: application/json" \ -H "Authorization: Bearer hello" \ -d '{ "model": "DeepSeek-R1-Distill-Qwen-32B", "messages": [{"role": "user", "content": "你是谁,听说能边写诗,边做数学题,那么请顺便证明1+1=2"}], "temperature": 0.9 }' |
请求返回
1 |
{"id":"chat-38d9d3904fe1462396781e3d17c4f632","object":"chat.completion","created":1743581032,"model":"DeepSeek-R1-Distill-Qwen-32B","choices":[{"index":0,"message":{"role":"assistant","content":"嗯,用户问我“你是谁,听说能边写诗,边做数学题,那么请顺便证明1+1=2”。首先,我需要理解用户的需求。看起来用户对我的能力有些好奇,想看看我能不能同时进行诗歌创作和数学证明。所以,我应该展示出我能够在不同领域都有能力表现。\n\n接下来,我要考虑如何回应。用户可能不仅仅是想看到诗和数学证明,而是想看看我是否有广泛的知识和应用能力。因此,我应该先写一首简短的诗,然后再进行数学证明,这样结构清晰,容易理解。\n\n在写诗的时候,我需要选择一个主题,比如自然或者日常生活,这样容易引起共鸣。比如,风、树、叶子这些元素很常见,容易描绘。然后,我要确保诗的结构流畅,有节奏感,同时传达出一种宁静或美好的感觉。\n\n关于数学证明,用户特别要求证明1+1=2,这是一个基本的数学事实,但我需要用一种简洁明了的方式来展示,可能用代数的方法或者逻辑推理。这样不仅证明了结果,还展示了思考过程。\n\n最后,我要总结一下,说明我不仅能写诗,还能进行数学证明,展示我的多功能性。这样用户就能看到我在不同领域的能力,满足他们的好奇心。\n\n在整个思考过程中,我需要确保语言自然,避免过于正式或生硬,让用户感觉到我的亲和力和专业性。同时,保持回答的结构清晰,先展示诗,再进行证明,最后总结,这样逻辑清晰,用户容易理解和接受。\n</think>\n\n好的,我可以边写诗边做数学题。首先,我来写一首诗:\n\n**风的韵律**\n\n风在树梢游荡\n带走了所有的叹息\n叶子轻轻摇晃\n像是在跳一支古老的舞\n\n现在,我来证明1+1=2。\n\n**数学证明:**\n\n我们知道,在数学中,数字是由抽象符号代表的概念。1通常代表一个单位,另一个1同样代表一个单位。将它们相加意味着将一个单位加上另一个单位。\n\n1 + 1 = 2\n\n这是因为当你将一个对象与另一个对象结合时,总数增加了一个单位。因此,1 + 1 = 2。\n\n总结来说,我不仅能够写诗,还能够进行数学证明。希望这对你有所帮助!","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":31,"total_tokens":608,"completion_tokens":577},"prompt_logprobs":null} |
测试过程中,观察容器0里面的 卡信息输出记录
基本看到单卡跑8b时 10token/s,gpu压力 94%左右
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
Every 5.0s: brsmi mgt1-10-10-2-12: Thu Apr 3 07:29:08 2025 Thu Apr 3 07:29:08 2025 +------------------------------------------------------------------------------+ | BR-SMI 1.6.6 Driver Version: 1.6.5 SUPA Version: N/A | +----------------------------+-----------------------+-------------------------+ | GPU Name | Bus-Id | Volatile Uncorr. ECC | | Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | Persistence-M | | SVI M. | +============================+=======================+=========================+ | 0 Biren110E | 00000000:2A:00.0 | 0 | | 51 P0 27W / 66W | 27032MiB / 32512MiB | 94% Default | | Off | | Disabled | +----------------------------+-----------------------+-------------------------+ | 1 Biren110E | 00000000:2B:00.0 | 0 | | 41 P0 13W / 66W | 0MiB / 32512MiB | 0% Default | | Off | | Disabled | +----------------------------+-----------------------+-------------------------+ | 2 Biren110E | 00000000:3D:00.0 | 0 | | 45 P0 17W / 66W | 0MiB / 32512MiB | 0% Default | | Off | | Disabled | +----------------------------+-----------------------+-------------------------+ | 3 Biren110E | 00000000:99:00.0 | 0 | | 38 P0 13W / 66W | 0MiB / 32512MiB | 0% Default | | Off | | Disabled | +----------------------------+-----------------------+-------------------------+ | 4 Biren110E | 00000000:9A:00.0 | 0 | Every 5.0s: brsmi mgt1-10-10-2-12: Thu Apr 3 07:29:19 2025 |

另一种压测脚本
单次请求
1 2 3 4 5 6 7 8 9 10 |
nerdctl run --rm registry.cn-shanghai.aliyuncs.com/jamesxiong/model-performance:amd64-v0.1.2 python run.py \ --api_key "asd" \ --model_name "DeepSeek-R1-Distill-Qwen-32B" \ --base_url "http://10.10.2.12:32578/v1" \ --system_prompt "" \ --history '[{"role": "user", "content": "kubernetes是什么?"}]' \ --gen_conf '{"temperature": 0.01}' \ --num_requests 1 \ --print_answer "no" \ --stream "yes" |
测试结果
输出基本是12 token/s
1 |
[Index]: 0, Start Time: 2025-04-02 07:49:50, End Time: 2025-04-02 07:51:32, First Token Time: 3.29s, Elapsed Time: 101.91s, Think Tokens: 443, Answer Tokens: 862, Total Tokens: 1304, Tokens per second: 12.80 |
20并发测试单卡8b
1 2 3 4 5 6 7 8 9 10 |
nerdctl run --rm registry.cn-shanghai.aliyuncs.com/jamesxiong/model-performance:amd64-v0.1.2 python run.py \ --api_key "asd" \ --model_name "DeepSeek-R1-Distill-Qwen-32B" \ --base_url "http://10.10.2.12:32578/v1" \ --system_prompt "" \ --history '[{"role": "user", "content": "kubernetes是什么?你希望上海还是北京,你能写诗么,请写一首4言绝句"}]' \ --gen_conf '{"temperature": 0.01}' \ --num_requests 20 \ --print_answer "no" \ --stream "yes" |
20并发测试结果单卡8b
并发输出为, 高低不等,6~12 token/s 左右
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[Index]: 3, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:52:57, First Token Time: 0.43s, Elapsed Time: 43.00s, Think Tokens: 380, Answer Tokens: 167, Total Tokens: 546, Tokens per second: 12.70 [Index]: 1, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:09, First Token Time: 0.44s, Elapsed Time: 55.11s, Think Tokens: 533, Answer Tokens: 172, Total Tokens: 704, Tokens per second: 12.77 [Index]: 7, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:15, First Token Time: 0.39s, Elapsed Time: 61.22s, Think Tokens: 563, Answer Tokens: 217, Total Tokens: 779, Tokens per second: 12.73 [Index]: 5, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:17, First Token Time: 0.41s, Elapsed Time: 63.03s, Think Tokens: 554, Answer Tokens: 247, Total Tokens: 800, Tokens per second: 12.69 [Index]: 4, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:18, First Token Time: 0.42s, Elapsed Time: 64.28s, Think Tokens: 565, Answer Tokens: 252, Total Tokens: 816, Tokens per second: 12.69 [Index]: 2, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:19, First Token Time: 0.44s, Elapsed Time: 64.99s, Think Tokens: 563, Answer Tokens: 270, Total Tokens: 832, Tokens per second: 12.80 [Index]: 6, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:19, First Token Time: 0.39s, Elapsed Time: 65.18s, Think Tokens: 560, Answer Tokens: 266, Total Tokens: 825, Tokens per second: 12.66 [Index]: 0, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:53:24, First Token Time: 0.23s, Elapsed Time: 70.85s, Think Tokens: 687, Answer Tokens: 202, Total Tokens: 888, Tokens per second: 12.53 [Index]: 9, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:01, First Token Time: 43.05s, Elapsed Time: 107.38s, Think Tokens: 479, Answer Tokens: 341, Total Tokens: 819, Tokens per second: 7.63 [Index]: 12, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:12, First Token Time: 64.98s, Elapsed Time: 118.46s, Think Tokens: 526, Answer Tokens: 165, Total Tokens: 690, Tokens per second: 5.82 [Index]: 8, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:13, First Token Time: 55.16s, Elapsed Time: 119.65s, Think Tokens: 566, Answer Tokens: 262, Total Tokens: 827, Tokens per second: 6.91 [Index]: 10, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:15, First Token Time: 61.30s, Elapsed Time: 121.09s, Think Tokens: 538, Answer Tokens: 227, Total Tokens: 764, Tokens per second: 6.31 [Index]: 15, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:18, First Token Time: 65.18s, Elapsed Time: 124.47s, Think Tokens: 541, Answer Tokens: 214, Total Tokens: 754, Tokens per second: 6.06 [Index]: 13, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:20, First Token Time: 64.29s, Elapsed Time: 126.23s, Think Tokens: 550, Answer Tokens: 236, Total Tokens: 785, Tokens per second: 6.22 [Index]: 11, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:21, First Token Time: 63.08s, Elapsed Time: 127.33s, Think Tokens: 565, Answer Tokens: 262, Total Tokens: 826, Tokens per second: 6.49 [Index]: 14, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:54:29, First Token Time: 70.80s, Elapsed Time: 134.98s, Think Tokens: 552, Answer Tokens: 278, Total Tokens: 829, Tokens per second: 6.14 [Index]: 19, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:04, First Token Time: 119.62s, Elapsed Time: 170.55s, Think Tokens: 497, Answer Tokens: 180, Total Tokens: 676, Tokens per second: 3.96 [Index]: 17, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:09, First Token Time: 121.12s, Elapsed Time: 174.78s, Think Tokens: 508, Answer Tokens: 242, Total Tokens: 749, Tokens per second: 4.29 [Index]: 16, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:10, First Token Time: 107.41s, Elapsed Time: 176.52s, Think Tokens: 608, Answer Tokens: 298, Total Tokens: 905, Tokens per second: 5.13 [Index]: 18, Start Time: 2025-04-02 07:52:14, End Time: 2025-04-02 07:55:14, First Token Time: 118.51s, Elapsed Time: 179.85s, Think Tokens: 565, Answer Tokens: 267, Total Tokens: 831, Tokens per second: 4.62 |
有空再调优后测试吧
文章评论