Gemma4 31B（GGUF）部署指南（containerd + llama.cpp，纯CPU）

🧩 环境

OS: CentOS 7
运行时: containerd（无 Docker）
模型格式: GGUF
推理: llama.cpp（容器版）

📦 一、准备模型

将 GGUF 模型下载到本地目录，例如：

1	/home/mnt/ai/data/models/gemma-4-31B-it/

目录示例：

1 2	gemma-q4.gguf # 推荐（CPU可用） gemma-q8.gguf # 更大更慢

🐳 二、拉取镜像

1	ctr images pull ghcr.io/ggml-org/llama.cpp:full

⚠️ 注意：

不要用 :server
必须用 :full（才包含 llama-server）

🔍 三、确认可执行文件路径（关键）

1 2	ctr run --rm -t ghcr.io/ggml-org/llama.cpp:full debug \ /bin/sh -c "find / -type f -executable 2>/dev/null \| grep llama-server"

输出：

1	/app/llama-server

👉 实际启动路径就是这个

🚀 四、启动服务

ctr run --rm -t \
--mount type=bind,src=/home/mnt/ai/data/models,dst=/models,options=rbind:rw \
ghcr.io/ggml-org/llama.cpp:full llama \
/app/llama-server \
-m /models/gemma-4-31B-it/gemma-q4.gguf \
-t 48 \
-c 4096 \
--host 0.0.0.0 \
--port 8080 \
--numa distribute \
--no-warmup

⏳ 五、启动过程说明

正常日志

1	load_tensors: loading model tensors...

👉 持续几十秒到几分钟（正常）

成功标志

1	main: server is listening on http://0.0.0.0:8080

🌐 六、调用接口（关键）

❌ 错误用法（会乱码）

1	/completion

会出现：

1	own own own ...

原因：没走 chat 模板

✅ 正确用法（必须用）

curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "system", "content": "直接输出结果，不要解释"},
    {"role": "user", "content": "请写一首关于清明的七律诗"}
  ],
  "max_tokens": 200,
  "temperature": 0.7,
  "top_p": 0.9,
  "reasoning_format": "none"
}'

curl http://192.168.1.121:8080/v1/chat/completions \

-H “Content-Type: application/json”
-d ‘{
“messages”: [
{“role”: “user”, “content”: “请写一首关于广西三月三节日的七律诗”}
],
“max_tokens”: 300,
“reasoning_in_content”: true
}’

curl http://192.168.1.121:8080/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“messages”: [
{“role”: “system”, “content”: “直接输出结果，不要解释”},
{“role”: “user”, “content”: “请写一首关于清明的七律诗”}
],
“max_tokens”: 200,
“temperature”: 0.7,
“top_p”: 0.9,
“reasoning_format”: “none”
}’

⚠️ 七、常见问题

1️⃣ 输出 `own own own`

原因：

使用了 /completion
prompt格式不对

解决：

1	使用 /v1/chat/completions

2️⃣ content 为空

表现：

1 2	"content": "", "reasoning_content": "..."

原因：

模型在“思考”
token被用光

解决：

1	"reasoning_format": "none"

3️⃣ CPU占满

👉 正常现象（高负载计算）

4️⃣ 很慢

当前性能：

1	~2.5 - 3 tokens/s（31B + CPU）

👉 属于正常

⚙️ 八、建议参数（CPU优化）

1
2
3

-t 32 ~ 64       # 线程数
-c 2048 ~ 4096   # 上下文
--numa distribute

🎯 九、总结

✔ 模型可正常运行
✔ API服务正常
✔ 使用 chat 接口是关键
✔ CPU运行性能有限但可用

Gemma4 31B（GGUF）部署指南（containerd + llama.cpp，纯CPU）

Gemma4 31B（GGUF）部署指南（containerd + llama.cpp，纯CPU）

🧩 环境

📦 一、准备模型

🐳 二、拉取镜像

🔍 三、确认可执行文件路径（关键）

🚀 四、启动服务

⏳ 五、启动过程说明

正常日志

成功标志

🌐 六、调用接口（关键）

❌ 错误用法（会乱码）

✅ 正确用法（必须用）

⚠️ 七、常见问题

1️⃣ 输出 own own own

2️⃣ content 为空

3️⃣ CPU占满

4️⃣ 很慢

⚙️ 八、建议参数（CPU优化）

🎯 九、总结

1️⃣ 输出 `own own own`