DGX-Spark LLM模型選擇 (2026Q2)

自從去年十月公司購入兩臺DGX-Spark以來，一直都在找一個速度不會太慢，同時又具備複雜解題能力的Local LLM模型

這件事在DGX-Spark上面可能是空集合…至少目前為止是

目前的主力模型是Qwen3.6-35B-A3B，兩個禮拜前剛從Qwen3.5-35B-A3B升級上來的，這兩個型號的表現可以說是令人驚艷，僅35B參數量3B活躍參數就可以表現得比GPT-OSS-120B還要好上許多。在使用體感上，Agent性能與Tool Call方面也是大幅領先GPT-OSS-120B

但是
DGX-Spark有128GB統一記憶體誒?
裝得下120B的機器只拿來裝35B是不是太浪費了一點
於是就開始了不斷調校模型參數、微調選想
想盡辦法讓他跑的又快又好的測試之旅

模型設定
#

下面的模型全部用的都是vllm/vllm-openai的容器跑起來的服務，vllm參數大部分是網羅網路上各家的優點，自己不斷排列組合，試錯試出來相對比較好的參數，可能不會是社群上的最佳解答，如果要抄我的作業還請注意 XD

另外，模型的本體檔案 .safetensors 我喜歡自己下載完，另外存到自己指定的路徑，而不是帶入 HF_TOKEN 參數在啟動時下載

Qwen3.5 35B-A3B
#

HuggingFace: Qwen/Qwen3.5-35B-A3B-FP8

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
services:
  qwen3.5:
    container_name: vllm-qwen3.5
    image: vllm/vllm-openai:v0.18.0-cu130
    restart: unless-stopped
    network_mode: "host"
    ipc: host
    volumes:
      - ~/models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
    command: >
      /models/Qwen/Qwen3.5-35B-A3B-FP8
      --host 0.0.0.0
      --port 18001
      --max-model-len 524288
      --gpu-memory-utilization 0.90
      --served-model-name qwen3.5
      --reasoning-parser qwen3
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 2.0, "original_max_position_embeddings": 262144}}}'
      --api-key sk-VERY_SECRET_KEY

Qwen3.5 如果加入 --kv-cache-dtype fp8 參數會無法啟動

Qwen3.6 35B-A3B
#

與 Qwen 3.5 的差異

vllm 版本升級到 0.20.0
KV-cache 指定用 fp8 格式 --kv-cache-dtype fp8
開啟MTP=2

HuggingFace: Qwen/Qwen3.6-35B-A3B-FP8

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
services:
  qwen3.6:
    container_name: vllm-qwen3.6
    image: vllm/vllm-openai:v0.20.0-cu130
    restart: unless-stopped
    network_mode: "host"
    ipc: host
    volumes:
      - ~/models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
    command: >
      /models/Qwen/Qwen3.6-35B-A3B-FP8
      --host 0.0.0.0
      --port 18001
      --max-model-len 524288
      --gpu-memory-utilization 0.90
      --served-model-name qwen3.6
      --reasoning-parser qwen3
      --enable-prefix-caching
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --kv-cache-dtype fp8
      --max-num-batched-tokens 4096
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 2.0, "original_max_position_embeddings": 262144}}}'
      --api-key sk-VERY_SECRET_KEY

Qwen3.5 122B-A10B
#

要放得下122B模型只能找4Bit量化的模型，找了找最後選了RedHatAI做出來的NVFP4版本，理論上DGX-Spark對NVFP4的支援度很高

參數基本上是抄Model Card的，但是我記得Qwen3.5對MTP的支援不好，所以這邊我有分別測關MTP跟MTP=2的狀況

HuggingFace: RedHatAI/Qwen3.5-122B-A10B-NVFP4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
services:
  qwen3.5:
    container_name: vllm-qwen3.5
    image: vllm/vllm-openai:v0.20.0-cu130
    restart: unless-stopped
    network_mode: "host"
    ipc: host
    volumes:
      - ~/models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - VLLM_NVFP4_GEMM_BACKEND=marlin
      - VLLM_USE_FLASHINFER_MOE_FP4=0
    command: >
      /models/RedHatAI/Qwen3.5-122B-A10B-NVFP4
      --served-model-name qwen3.5
      --host 0.0.0.0
      --port 8000
      --async-scheduling
      --dtype auto
      --kv-cache-dtype fp8
      --tensor-parallel-size 1
      --pipeline-parallel-size 1
      --data-parallel-size 1
      --trust-remote-code
      --gpu-memory-utilization 0.90
      --enable-chunked-prefill
      --max-num-seqs 4
      --max-model-len 262144
      --max-num-batched-tokens 8192
      --reasoning-parser qwen3
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder

Nemotron-3-Super 120B
#

Nemotron-3-Super畢竟是NVIDIA的親兒子，HuggingFace上面就直接有完整的vllm啟動指令可以抄，甚至docker版本都準備好了

基本上下面這一份完全是Model Card上面的內容，我只有改模型參數跟改成docker compose方式啟動container

HuggingFace: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
services:
  nemotron3-super:
    container_name: vllm-nemotron3
    image: vllm/vllm-openai:v0.20.0-cu130
    restart: unless-stopped
    network_mode: "host"
    ipc: host
    volumes:
      - ~/models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
      - VLLM_NVFP4_GEMM_BACKEND=marlin
      - VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm
      - VLLM_USE_FLASHINFER_MOE_FP4=0
    command: >
      /models/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
      --served-model-name nvidia/nemotron-3-super
      --host 0.0.0.0
      --port 8000
      --async-scheduling
      --dtype auto
      --kv-cache-dtype fp8
      --tensor-parallel-size 1
      --pipeline-parallel-size 1
      --data-parallel-size 1
      --trust-remote-code
      --gpu-memory-utilization 0.90
      --enable-chunked-prefill
      --max-num-seqs 4
      --max-model-len 1000000
      --moe-backend marlin
      --mamba_ssm_cache_dtype float16
      --quantization fp4
      --speculative_config '{"method":"mtp","num_speculative_tokens":3,"moe_backend":"triton"}'
      --reasoning-parser-plugin /models/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/super_v3_reasoning_parser.py
      --reasoning-parser super_v3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder

Benchmark測試
#

由於是放在公司內供團隊使用，考量到會同時有Coding Agent與AI Agent跟大量併發請求
所以我個人比較關注於速度方面的指標，特別是併發場景下的速度單一請求的速度
因此只比較速度方面的指標，模型的能力與智商方面參考官方的說法就好

測試使用的是llama-benchy，這是一個可適用於所有LLM後端的Benchmark工具
只要LLM後端有提供OpenAI API相容的 /v1/chat/completions 端點就可以用

測試時llama-benchy版本0.3.7

測試參數
#

llama-benchy --base-url http://127.0.0.1:18001/v1 \
  --model qwen3.5 \
  --api-key sk-VERY_SECRET_KEY \
  --tokenizer /path/to/Qwen/Qwen3.5-35B-A3B-FP8/ \
  --concurrency 1 2 4 8 --pp 2048 --tg 128

測試參數的部份基本上都一樣，僅各個模型的tokenizer不同而已

測試結果
#

Qwen3.5-35B-A3B-FP8 (no-MTP)

model	test	t/s (total)	t/s (req)	ttfr (ms)	e2e_ttft (ms)
qwen3.5	pp2048 (c1)	4009.02 ± 80.90	4009.02 ± 80.90	513.13 ± 10.36	513.13 ± 10.36
qwen3.5	tg128 (c1)	49.77 ± 0.10	49.77 ± 0.10
qwen3.5	pp2048 (c2)	4138.44 ± 17.21	2695.37 ± 621.94	804.91 ± 185.38	804.91 ± 185.38
qwen3.5	tg128 (c2)	72.17 ± 0.20	38.41 ± 1.92
qwen3.5	pp2048 (c4)	4193.79 ± 114.18	1827.23 ± 787.25	1322.46 ± 486.34	1322.46 ± 486.34
qwen3.5	tg128 (c4)	95.32 ± 1.80	27.73 ± 2.61
qwen3.5	pp2048 (c8)	4021.53 ± 3.90	1205.39 ± 833.24	2386.87 ± 1135.77	2386.87 ± 1135.77
qwen3.5	tg128 (c8)	115.43 ± 0.64	18.92 ± 2.76

Qwen3.6-35B-A3B-FP8 (MTP=2)

model	test	t/s (total)	t/s (req)	ttfr (ms)	e2e_ttft (ms)
qwen3.6	pp2048 (c1)	4565.80 ± 193.56	4565.80 ± 193.56	450.55 ± 19.72	450.55 ± 19.72
qwen3.6	tg128 (c1)	59.00 ± 1.75	59.00 ± 1.75
qwen3.6	pp2048 (c2)	5262.23 ± 33.47	2731.41 ± 98.80	751.88 ± 27.24	751.88 ± 27.24
qwen3.6	tg128 (c2)	92.08 ± 0.04	46.63 ± 0.66
qwen3.6	pp2048 (c4)	5507.71 ± 8.14	1442.14 ± 48.08	1423.17 ± 47.16	1423.17 ± 47.16
qwen3.6	tg128 (c4)	127.24 ± 3.65	33.60 ± 1.46
qwen3.6	pp2048 (c8)	5561.15 ± 14.13	881.16 ± 252.13	2473.25 ± 526.49	2473.25 ± 526.49
qwen3.6	tg128 (c8)	172.29 ± 2.10	28.31 ± 2.56

Qwen3.5-122B-A10B-NVFP4 (no-MTP)

model	test	t/s (total)	t/s (req)	ttfr (ms)	e2e_ttft (ms)
qwen3.5	pp2048 (c1)	1813.93 ± 116.01	1813.93 ± 116.01	1135.78 ± 75.55	1135.78 ± 75.55
qwen3.5	tg128 (c1)	16.29 ± 0.02	16.29 ± 0.02
qwen3.5	pp2048 (c2)	2142.90 ± 99.84	1072.39 ± 50.00	1916.49 ± 91.87	1916.49 ± 91.87
qwen3.5	tg128 (c2)	31.74 ± 0.81	15.87 ± 0.40
qwen3.5	pp2048 (c4)	2203.46 ± 54.94	613.46 ± 63.68	3376.13 ± 336.09	3376.13 ± 336.09
qwen3.5	tg128 (c4)	48.64 ± 0.68	12.55 ± 0.41
qwen3.5	pp2048 (c8)	956.27 ± 3.37	373.08 ± 256.90	10070.72 ± 6751.63	10070.72 ± 6751.63
qwen3.5	tg128 (c8)	42.06 ± 0.50	12.18 ± 0.47

Qwen3.5-122B-A10B-NVFP4 (MTP=2)

model	test	t/s (total)	t/s (req)	ttfr (ms)	e2e_ttft (ms)
qwen3.5	pp2048 (c1)	1541.53 ± 90.61	1541.53 ± 90.61	1335.71 ± 79.84	1335.71 ± 79.84
qwen3.5	tg128 (c1)	10.44 ± 0.03	10.44 ± 0.03
qwen3.5	pp2048 (c2)	1554.01 ± 285.12	781.84 ± 140.72	2701.32 ± 438.35	2701.32 ± 438.35
qwen3.5	tg128 (c2)	18.57 ± 0.30	9.30 ± 0.13
qwen3.5	pp2048 (c4)	1706.71 ± 435.96	468.93 ± 88.40	4613.70 ± 1290.81	4613.70 ± 1290.81
qwen3.5	tg128 (c4)	29.06 ± 2.17	7.53 ± 0.47
qwen3.5	pp2048 (c8)	676.59 ± 2.91	302.44 ± 217.76	14060.28 ± 10118.51	14060.28 ± 10118.51
qwen3.5	tg128 (c8)	27.64 ± 0.11	7.62 ± 0.18

Nemotron-3-Super-120B (MTP=3)

model	test	t/s (total)	t/s (req)	ttfr (ms)	e2e_ttft (ms)
nemotron-3-super	pp2048 (c1)	1459.95 ± 31.19	1459.95 ± 31.19	1406.37 ± 30.08	1406.37 ± 30.08
nemotron-3-super	tg128 (c1)	21.12 ± 2.64	21.12 ± 2.64
nemotron-3-super	pp2048 (c2)	1507.16 ± 12.52	797.52 ± 43.66	2578.41 ± 140.78	2578.41 ± 140.78
nemotron-3-super	tg128 (c2)	36.18 ± 1.97	19.14 ± 1.41
nemotron-3-super	pp2048 (c4)	1443.43 ± 166.63	500.68 ± 201.80	4647.43 ± 1421.92	4647.43 ± 1421.92
nemotron-3-super	tg128 (c4)	41.60 ± 1.48	13.52 ± 1.67
nemotron-3-super	pp2048 (c8)	868.99 ± 5.34	316.90 ± 238.16	10839.04 ± 6589.63	10839.04 ± 6589.63
nemotron-3-super	tg128 (c8)	39.72 ± 0.69	12.49 ± 1.48

篇幅關係，輸出表格僅保留重點指標，刪除 peak t/s, peak t/s (req), est_ppt (ms) 欄位

結論
#

我個人主觀的模型速度底線是 20 token/s
低於這個速度，我會認為這個模型是慢到不可用的
雖然DGX-Spark有128GB的大容量統一記憶體
但優勢也僅有大容量而已，受限於LPDDR5x頻寬只有273GB/s
跑參數量稍微多一點的模型就慢到不行

模型	參數量	量化	MTP	單併發平均速度	四併發平均速度
qwen3.5	35B-A3B	FP8	no	49.77	27.73
qwen3.6	35B-A3B	FP8	2	59.00	33.60
qwen3.5	122B-A10B	NVFP4	no	16.29	12.55
qwen3.5	122B-A10B	NVFP4	2	10.44	7.53
nemotron-3-super	120B-A12B	NVFP4	3	21.12	13.52

這邊沒有測Qwen3.6關MTP的情況，無法斷言MTP對速度的影響，不過在Qwen3.5 122B-A10B的情況下，開啟MTP=2反而會讓速度下降不少

綜合考量下來
我會推薦以Qwen3.6 35B-A3B作為公司內的主力模型使用
然後用yarn把Context開到524288盡量用滿RAM，滿足大上下文使用需求
如果是個人用我才會推薦跑120B左右NVFP4量化的模型
在滿足複雜解題能力的情況下才不會太慢

當然，如果公司內比較需要複雜解題能力，那還是只能跑120B的模型

為什麼不測Qwen3.5/3.6 27B呢?
因為27B是Dense模型，全部的27B都是活躍參數
35B-A3B是MoE模型，每個token僅3B活躍參數
速度快的關鍵就是MoE讓每個token只有3B的參數在參與推論
如果27B全部都下去跑，速度可能只有10 token/s甚至不到
所以我就沒測試這部份了

不過，如果你家裡有一張VRAM>24GB 的GPU 如 3090/4090等
跑一個GGUF Q4量化的 Qwen3.6 27B應該是會非常強

以下是我能查到的一些記憶體頻寬數值
可以看到DGX-Spark的記憶體頻寬低的可憐 QQ

型號	記憶體類型	容量	記憶體頻寬
DGX Spark	LPDDR5x	128 GB	273 GB/s¹
RTX PRO 6000 Blackwell	GDDR7 ECC	96 GB	1,792 GB/s²
RTX 5090	GDDR7	32 GB	1,792 GB/s³
RTX 5080	GDDR7	16 GB	960 GB/s⁴
RTX 5070 Ti	GDDR7	16 GB	896 GB/s⁵
RTX 4090	GDDR6X	24 GB	1,008 GB/s⁶
RTX 3090	GDDR6X	24 GB	936 GB/s⁷

模型設定#

Qwen3.5 35B-A3B#

Qwen3.6 35B-A3B#

Qwen3.5 122B-A10B#

Nemotron-3-Super 120B#

Benchmark測試#

測試參數#

測試結果#

結論#

模型設定
#

Qwen3.5 35B-A3B
#

Qwen3.6 35B-A3B
#

Qwen3.5 122B-A10B
#

Nemotron-3-Super 120B
#

Benchmark測試
#

測試參數
#

測試結果
#

結論
#