使用autodl服务器，在A40显卡上运行， Yi-34B-Chat-int4模型，并使用vllm优化加速，显存占用42G，速度18 words/s

本文介绍: “messages“: [{“role“: “user“, “content“: “北京景点，使用中文回答”}],}’

https://www.bilibili.com/video/BV1gu4y1c7KL/

2020年，英伟达发布 A40 专业显卡，配备 48GB 显存。

采用了 GA102 GPU，拥有 10752 个 CUDA 核心。而 A40 是用于服务器的。A40 都配备了四个 DP 1.4 接口，都搭载了 48GB 的 GDDR6 显存，最大带宽为 696 GB/s。显卡的功耗为 300W，采用了全新的 8 pin 接口。

环境使用：
CPU ：15 核心
 内存：80 GB
GPU ：NVIDIA A40, 1个

11月24日，零一万物基正式发布并开源微调模型 Yi-34B-Chat，可申请免费商用。同时，零一万物还为开发者提供了 4bit/8bit 量化版模型，Yi-34B-Chat 4bit 量化版模型可以直接在消费级显卡（如RTX3090）上使用。

apt update &amp;&amp; apt install -y git-lfs net-tools
#
git clone https://www.modelscope.cn/01ai/Yi-34B-Chat-4bits.git


# 1，安装 torch 模块，防止依赖多次下载
pip3 install torch==2.1.0

# 2，安装 vllm 模块：
pip3 install vllm

# 最后安装 
pip3 install "fschat[model_worker,webui]" auto-gptq optimum

ValueError: Unknown quantization method: gptq. Must be one of ['awq', 'squeezellm'].

# run_all_vllm_yi.sh

# 清除全部 fastchat 服务
ps -ef | grep fastchat.serve | awk '{print$2}' | xargs kill -9
sleep 3

rm -f *.log

# 首先启动 controller ：
nohup python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 &gt; controller.log 2&gt;&amp;1 &amp;

# 启动 openapi的 兼容服务 地址 8000
nohup python3 -m fastchat.serve.openai_api_server --controller-address http://127.0.0.1:21001 
  --host 0.0.0.0 --port 8000 &gt; api_server.log 2&gt;&amp;1 &amp;

# 启动 web ui
nohup python -m fastchat.serve.gradio_web_server --controller-address http://127.0.0.1:21001 
 --host 0.0.0.0 --port 8000 > web_server.log 2>&amp;1 &amp;


# 然后启动模型： 说明，必须是本地ip --load-8bit 本身已经是int4了
# nohup python3 -m fastchat.serve.model_worker  --model-names yi-34b 
#   --model-path ./Yi-34B-Chat-8bits --controller-address http://${IP_ADDR}:21001 
#   --worker-address http://${IP_ADDR}:8080 --host 0.0.0.0 --port 8080 > model_worker.log 2>&amp;1 &amp;

## 
nohup python3 -m fastchat.serve.vllm_worker --quantization awq --model-names yi-34b 
  --model-path ./Yi-34B-Chat-4bits --controller-address http://127.0.0.1:21001 
  --worker-address http://127.0.0.1:8080 --host 0.0.0.0 --port 8080 > model_worker.log 2>&amp;1 &amp;

python3 -m fastchat.serve.test_throughput --controller-address http://127.0.0.1:21001 --model-name yi-34b --n-thread 1

throughput: 18.678158839922936 words/s.

curl http://localhost:6006/v1/chat/completions   -H "Content-Type: application/json"   -d '{
     "model": "chatglm3-6b",
     "messages": [{"role": "user", "content": "北京景点，使用中文回答"}],
     "temperature": 0.7
   }'