模型基础知识
国内站:
https://www.modelscope.cn/
参考:
HuggingFace快速入门
https://blog.csdn.net/chengxuyuanyy/article/details/140059298
模型认识
chatglm 6b
60亿参数
1个参数 4个字节,共32位(bit)
32bit 全精度
16bit 半精度
8bit 8位量量
4bit 4位量化
模型调优版本
名称带Instruct
Qwen2-72B-Instruct和Qwen2-72B的主要区别在于Qwen2-72B-Instruct经过了指令调优,而Qwen2-72B则是基础语言模型。 Qwen2-72B-Instruct在指令遵循和人类价值观对齐方面进行了优化,使其在处理指令和理解人类意图时表现更好12。
Qwen2-72B-Instruct在指令调优过程中,通过大量的指令数据对模型进行了训练,使其能够更好地理解和执行各种指令。这使得Qwen2-72B-Instruct在处理复杂任务和遵循指令方面表现出色12。
Qwen2-72B则是一个基础语言模型,没有经过指令调优。它在自然语言理解、知识推理和多语言能力方面表现出色,但在指令遵循方面可能不如Qwen2-72B-Instruct
模型显存
模型显存 = 1.2 * (模型参数 * 4) / (32位 / 量化位数)
7b(4bit量化版) = 1.2 * (7 * 4) / (32/4) = 4.2
14b(4bit量化版) = 1.2 * (14 * 4) / (32/4) = 8.4
32b(4bit量化版) = 1.2 * (32 * 4) / (32/4) = 19.2
72b(4bit量化版) = 1.2 * (72 * 4) / (32/4) = 43.199999999999996
量化
量化的方式一般分为三类:
1、权重量化:将模型权重从浮点数转换为低位宽整数。
2、激活量化:在模型推理过程中,将中间层的激活输出转换为低位宽的表示。
3、全模型量化:同时对权重和激活进行量化。
比如我们一般提到的4-bit量化、16-bit量化等,指的是将模型中的权重和激活值从高精度的浮点数转换为低精度的4位和16位来表示。这种方法可以显著减少模型的存储需求和计算复杂度,同时尽量保持模型的性能。
gptq 量化
awq 量化
Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
Qwen/Qwen2.5-7B-Instruct-AWQ
模型下载
git整体下载(安装lfs工具)
git lfs install
git clone https://hf-mirror.com/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
git忽略lfs文件下载,手动下载lfs文件覆盖
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
huggingface-cli
安装huggingface_hub
pip install -U huggingface_hub
设置环境变量,使得下载时默认从国内的镜像站https://hf-mirror.com下载
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download –resume-download Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 –local-dir Qwen/Qwen2-72B-Instruct-GPTQ-Int4 –local-dir-use-symlinks False
模型使用
推理
ollama(可以在gpu和cpu运行)
ollama run qwen2.5:7b
https://ollama.com/blog/openai-compatibility
curl http://localhost:11434/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “qwen2.5”,
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant.”
},
{
“role”: “user”,
“content”: “Hello!”
}
]
}’
curl http://localhost:11434/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “qwen2.5”,
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant.”
},
{
“role”: “user”,
“content”: “Hello!”
}
],
“stream”: true,
“temperature”: 0.01
}’
https://ollama.com/blog/embedding-models
curl http://localhost:11434/api/embeddings -d ‘{
“model”: “mxbai-embed-large”,
“prompt”: “Llamas are members of the camelid family”
}’
ollama扩展:转换模型,运行自己的模型
ollama show –modelfile llama2:13b
大模型转换为 GGUF 以及使用 ollama 运行
https://zhuanlan.zhihu.com/p/715841536
Ollama Modelfile官方文档
https://blog.csdn.net/Chaos_Happy/article/details/138276172
大模型中 .safetensors 、.ckpt、.gguf、.pth、.bin
https://www.cnblogs.com/qcy-blog/p/18195616
vllm (必须要有显卡,启动后预占显存)
python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2-7B-Instruct-AWQ –served-model=Qwen2-7B-Instruct-AWQ –dtype=half –tensor-parallel-size=1 –quantization=awq –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1
python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2-7B-Instruct-GPTQ-Int4 –served-model=Qwen2-7B-Instruct-GPTQ-Int4 –dtype=half –tensor-parallel-size=1 –quantization=gptq –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1
python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2-7B-Instruct –served-model=Qwen2-7B-Instruct –dtype=half –tensor-parallel-size=1 –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1
–host
主机地址
–port
端口
–model
加载的模型路径
–trust-remote-code
允许模型加载来自 huggingface的远程代码
–tensor-parallel-size
采用的卡数,此处为单机多卡状态,多级多卡需先配置ray
–pipeline-parallel-size
多机配置,多级多卡可采用此配置,但也可以直接在tensor-parallel-size指定足够的卡
多机多卡文档后续补充
–served-model-name
此处为远程API调用时指定的模型名称
–device
使用的设备名称,常用的为cuda
–dtype
模型权重和激活的数据类型,常用的为bfloat16,不指定为auto
–max-model-len
模型处理的上下文长度。如果未指定,将自动从模型配置中派生
–gpu-memory-utilization
模型GPU利用率,未指定为0.9,长期稳定0.9,不会以为你的模型参数大或小而变化
–enable-prefix-caching
启用自动前缀缓存,可以避免在多次推理时重复计算相同的前缀
–enforce-eager
强制使用 Eager 模式,确保所有操作即时执行。默认为False,将在混合模式下使用 Eager 模式和 CUDA 图,以获得最大的性能和灵活性。
https://docs.vllm.ai/en/latest/getting_started/quickstart.html
https://blog.csdn.net/spicy_chicken123/article/details/135813924
https://blog.csdn.net/baiyipiao/article/details/141930442
Xinference
XINFERENCE_MODEL_SRC=modelscope xinference-local –host 0.0.0.0 –port 3014
https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html
训练(微调)
llama factory
LlamaFactory可视化微调大模型 - 参数详解
https://baijiahao.baidu.com/s?id=1804161804962042559&wfr=spider&for=pc
Unsloth
本地Api
OpenAI 接口
Chat Completions 会话补全:/v1/chat/completions
Completions (文本和代码)补全:/v1/completions
Embeddings 嵌入:/v1/embeddings
参考:万字长文|关于 OpenAI 接口开发你应该知道的一切
https://blog.csdn.net/2401_85325557/article/details/140202624
智能体平台
fastgpt (知识库)
https://doc.tryfastgpt.ai/docs/intro/
https://github.com/labring/FastGPT
dify(工作流)
https://github.com/langgenius/dify
https://docs.dify.ai/
bisheng(工作流)
https://dataelem.feishu.cn/wiki/ZxW6wZyAJicX4WkG0NqcWsbynde
https://github.com/dataelement/bisheng
oneapi(openai接口聚合,聚合)
https://github.com/songquanpeng/one-api
AI开发基础
提示词工程
LangchainJS
openAI接口
python环境
conda 环境安装
conda env list
conda activate xinference
其它
curl http://10.19.93.53:3015/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “Qwen2-72B-Instruct-GPTQ-Int4”,
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant.”
},
{
“role”: “user”,
“content”: “Hello!”
}
],
“stream”: “TRUE”
}’
(vllm) jovyan@48d5f2c75bf5:~$ python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 –served-model=Qwen2-72B-Instruct-GPTQ-Int4 –dtype=half –tensor-parallel-size=1 –quantization=gptq –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1
curl -X ‘POST’
‘http://10.19.93.53:3014/v1/chat/completions ‘
-H ‘accept: application/json’
-H ‘Content-Type: application/json’
-d ‘{
“model”: “qwen2-instruct”,
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant.”
},
{
“role”: “user”,
“content”: “What is the largest animal?”
}
]
}’
curl http://10.19.93.53:3014/v1/embeddings
-H “Content-Type: application/json” -d ‘{
“model”: “m3e-small”,
“input”:[
“LlamaEdge is the easiest & fastest way to run customized and fine-tuned LLMs locally or on the edge.”
]}’
curl http://localhost:11434/v1/embeddings -d ‘{
-H “Content-Type: application/json” -d ‘{
“model”: “m3e-small”,
“input”:[
“LlamaEdge is the easiest & fastest way to run customized and fine-tuned LLMs locally or on the edge.”
]}’
}’
–oneapi
curl http://10.19.195.148:30001/v1/chat/completions
-H “Content-Type: application/json”
-H “Authorization: Bearer sk-6Ub0NErqby31pYV9Ac65CcAc678248Cd981eCaB90b87BaEe”
-d ‘{
“model”: “qwen2-instruct”,
“messages”: [
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “你是谁”}
]
}’
curl http://10.19.93.53:3014/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “qwen2-instruct”,
“messages”: [
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “你是谁”}
]
}’
- 标题: 模型基础知识
- 作者: 菇太帷i
- 创建于 : 2024-10-08 08:49:00
- 更新于 : 2025-09-18 06:39:53
- 链接: https://blog.gutawei.com/2024/10/08/Technology Stack/模型基础知识/
- 版权声明: 本文章采用 CC BY-NC-SA 4.0 进行许可。