最近接触到了不少GPU机器,有H100、 H800、H20、 A800、 rtx 4090 ,还有国产的 沐曦 MetaX C500、壁仞Biren110E、燧原S60。 今天简单测试一下沐曦 MetaX C500上面部署 DeepSeek-R1-Llama-70B 模型
官方文档 Metax C500-DeepSeek R1蒸馏模型推 理部署手册是基于docker 部署的,这里改成k8s 的sts方式部署,然后做了几个简单测试,效果看起来不错。
本次实际为 8卡 物理机,理论部署 70b 4张卡足够了。沐曦官方介绍如下:
在推理中,我们可以根据下面的公式 估算得到推理所需要的显存大小。
1. FP16/BF16(半精度): = 参数 x2 x 1.2 2. INT8 (整数): = 参数 x 1.1 |
从上面的公式中可以得出,使用70B的模型大多数默认使用的是BF16进行推理,进行推理大 概需要 70 x 2 x 1.2=168GB, 此外还需要考虑kv cache的显存占用, 其计算公式可参考如 下
num_hidden_layers * head_dim * seq_len * bs * 2 * precission # 这里若采用 bfloat16精度则占用两个字节,即需要乘以2 |
另外, vllm会根据预设的显存池大小来计算预留的block, 越长的context通常需要越多 block。综合以上考虑, 70B模型的C500推理推荐四卡以上。
https://sw-download.metax-tech.com/index 这里下载 容器镜像
kind: StatefulSet apiVersion: apps/v1 metadata: name: deepseek-r1-sts namespace: deepseek spec: replicas: 2 selector: matchLabels: app: deepseek-r1-sts template: metadata: creationTimestamp: null labels: app: deepseek-r1-sts spec: volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 10Gi - name: data hostPath: path: /root/vllm/DeepSeek-R1-Distill-Llama-70B type: DirectoryOrCreate - name: tokenizer hostPath: path: /root/vllm/DeepSeek-R1-Distill-Llama-70B type: DirectoryOrCreate containers: - name: container-1 image: inference-llm-deepseekr1_opt:v2 command: - /bin/bash - '-c' args: - | export PATH="/opt/conda/bin:$PATH" if [ "$POD_INDEX" = "0" ]; then ray start --head --port=5000 --num-gpus=4 vllm serve /root/vllm/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-r1-70b --trust-remote-code --tensor-parallel-size=4 --pipeline-parallel-size=1 --max-model-len=65536 --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill # --enforce-eager else ray start --block --address=deepseek-r1-sts-master:5000 --num-gpus=4 fi env: - name: NCCL_CUMEM_ENABLE value: '0' - name: NCCL_DEBUG value: INFO - name: VLLM_LOGGING_LEVEL value: DEBUG - name: NCCL_IB_DISABLE value: '1' - name: POD_INDEX valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.labels['apps.kubernetes.io/pod-index'] resources: limits: metax-tech.com/gpu: '4' volumeMounts: - name: dshm mountPath: /dev/shm - name: data mountPath: /root/vllm/DeepSeek-R1-Distill-Llama-70B terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: IfNotPresent securityContext: privileged: false restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst securityContext: {} affinity: {} schedulerName: default-scheduler tolerations: - key: node.kubernetes.io/not-ready operator: Exists effect: NoExecute tolerationSeconds: 300 - key: node.kubernetes.io/unreachable operator: Exists effect: NoExecute tolerationSeconds: 300 dnsConfig: {} serviceName: '' podManagementPolicy: OrderedReady updateStrategy: type: RollingUpdate rollingUpdate: partition: 0 revisionHistoryLimit: 10 persistentVolumeClaimRetentionPolicy: whenDeleted: Retain whenScaled: Retain --- kind: Service apiVersion: v1 metadata: name: deepseek-r1-sts-master namespace: deepseek spec: ports: - name: tcp-5000 protocol: TCP port: 5000 targetPort: 5000 - name: tcp-8000 protocol: TCP port: 8000 targetPort: 8000 selector: app: deepseek-r1-sts apps.kubernetes.io/pod-index: '0' clusterIP: None clusterIPs: - None type: ClusterIP sessionAffinity: None ipFamilies: - IPv4 ipFamilyPolicy: SingleStack internalTrafficPolicy: Cluster --- kind: Service apiVersion: v1 metadata: name: deepseek-r1-sts-master-external namespace: deepseek spec: ports: - name: tcp-5000 protocol: TCP port: 5000 targetPort: 5000 nodePort: - name: tcp-8000 protocol: TCP port: 8000 targetPort: 8000 nodePort: selector: app: deepseek-r1-sts apps.kubernetes.io/pod-index: '0' clusterIP: clusterIPs: type: NodePort sessionAffinity: None externalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack internalTrafficPolicy: Cluster |
curl -H "Content-Type: application/json" -d '{ "model": "deepseek-r1-70b", "prompt": "你是谁,简单用中英日各100字介绍一下你自己", "max_tokens": 2048, "temperature": 0.7}' |

python /workspace/deepseek/test/benchmark_serving.py \ --model deepseek-r1-70b \ --tokenizer /root/vllm/DeepSeek-R1-Distill-Llama-70B \ --dataset_name random --random_input_len 3072 --random_output_len 1024 --num-prompts 64 \ --trust-remote-code --ignore-eos --port 8000 |
Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [01:57<00:00, 1.84s/it] ============ Serving Benchmark Result ============ Successful requests: 64 Benchmark duration (s): 117.79 Total input tokens: 196608 Total generated tokens: 65536 Request throughput (req/s): 0.54 Output token throughput (tok/s): 556.37 Total Token throughput (tok/s): 2225.50 ---------------Time to First Token---------------- Mean TTFT (ms): 25564.03 Median TTFT (ms): 25332.88 P99 TTFT (ms): 49867.34 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 87.69 Median TPOT (ms): 88.07 P99 TPOT (ms): 108.19 ---------------Inter-token Latency---------------- Mean ITL (ms): 87.69 Median ITL (ms): 66.53 P99 ITL (ms): 525.14 ================================================== |
测试时mx-smi 的输出
Every 2.0s: mx-smi deepseek-r1-sts-1: Sun Mar 16 10:14:45 2025 mx-smi version: 2.1.9 =================== MetaX System Management Interface Log =================== Timestamp : Sun Mar 16 10:14:45 2025 Attached GPUs : 4 +---------------------------------------------------------------------------------+ | MX-SMI 2.1.9 Kernel Mode Driver Version: 2.9.8 | | MACA Version: BIOS Version: | |------------------------------------+---------------------+----------------------+ | GPU NAME | Bus-id | GPU-Util | | Temp Power | Memory-Usage | | |====================================+=====================+======================| | 0 MXC500 | 0000:08:00.0 | 21% | | 44C 125W | 59723/65536 MiB | | +------------------------------------+---------------------+----------------------+ | 1 MXC500 | 0000:09:00.0 | 21% | | 43C 125W | 59243/65536 MiB | | +------------------------------------+---------------------+----------------------+ | 2 MXC500 | 0000:0e:00.0 | 21% | | 43C 129W | 59243/65536 MiB | | +------------------------------------+---------------------+----------------------+ | 3 MXC500 | 0000:11:00.0 | 21% | | 44C 130W | 59243/65536 MiB | | +------------------------------------+---------------------+----------------------+ +---------------------------------------------------------------------------------+ | Process: | | GPU PID Process Name GPU Memory | | Usage(MiB) | |=================================================================================| | 0 326 ray::RayWorkerW 58804 | | 1 327 ray::RayWorkerW 58324 | | 2 328 ray::RayWorkerW 58324 | | 3 329 ray::RayWorkerW 58324 | +---------------------------------------------------------------------------------+ |
nerdctl run --rm registry.cn-shanghai.aliyuncs.com/jamesxiong/model-performance:amd64-v0.1.2 python run.py --api_key "hello-world" --model_name "deepseek-r1-70b" --base_url "" --system_prompt "" --history '[{"roroot@internstudio-mxgpu01:~# nerdctl run --rm registry.cn-shanghai.aliyuncs.com/jamesxiong/model-performance:amd64-v0.1.2 python run.py --api_key "hello-world" --model_name "deepseek-r1-70b" --base_url "" --system_prompt "" --history '[{"role": "user", "content": "你怎么看待上海在未来10年AI领域的发展"}]' --gen_conf '{"temperature": 0.01}' --num_requests 100 --print_answer "no" --stream "yes" |
平均22 tokens /s 出头的样子,能接受
vllm serve /root/vllm/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-r1-70b --trust-remote-code --tensor-parallel-size=2 --pipeline-parallel-size=2 --max-model-len=4096 --enable-reasoning --reasoning-parser deepseek_r1 --enable-chunked-prefill --dtype bfloat16# --device cuda #--enforce-eager |
--num-gpus=2 和 metax-tech.com/gpu: '2' 这2个地方(由4改成2)
Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... Traffic request rate: inf Burstiness factor: 1.0 (Poisson process) Maximum request concurrency: None 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [04:17<00:00, 4.02s/it] ============ Serving Benchmark Result ============ Successful requests: 64 Benchmark duration (s): 257.21 Total input tokens: 196608 Total generated tokens: 65536 Request throughput (req/s): 0.25 Output token throughput (tok/s): 254.80 Total Token throughput (tok/s): 1019.19 ---------------Time to First Token---------------- Mean TTFT (ms): 24195.59 Median TTFT (ms): 24104.36 P99 TTFT (ms): 46168.25 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 223.88 Median TPOT (ms): 224.07 P99 TPOT (ms): 241.00 ---------------Inter-token Latency---------------- Mean ITL (ms): 223.88 Median ITL (ms): 205.77 P99 ITL (ms): 962.69 ================================================== |