国外站：
https://huggingface.co/

镜像站：
https://hf-mirror.com/

国内站：
https://www.modelscope.cn/

参考：
HuggingFace快速入门
https://blog.csdn.net/chengxuyuanyy/article/details/140059298

模型认识

chatglm 6b
60亿参数
1个参数 4个字节，共32位（bit）

32bit 全精度
16bit 半精度

8bit 8位量量
4bit 4位量化

模型调优版本

名称带Instruct

‌Qwen2-72B-Instruct和Qwen2-72B的主要区别在于Qwen2-72B-Instruct经过了指令调优，而Qwen2-72B则是基础语言模型。‌ Qwen2-72B-Instruct在指令遵循和人类价值观对齐方面进行了优化，使其在处理指令和理解人类意图时表现更好‌12。
Qwen2-72B-Instruct在指令调优过程中，通过大量的指令数据对模型进行了训练，使其能够更好地理解和执行各种指令。这使得Qwen2-72B-Instruct在处理复杂任务和遵循指令方面表现出色‌12。
Qwen2-72B则是一个基础语言模型，没有经过指令调优。它在自然语言理解、知识推理和多语言能力方面表现出色，但在指令遵循方面可能不如Qwen2-72B-Instruct‌

模型显存

模型显存 = 1.2 * (模型参数 * 4) / (32位 / 量化位数)

7b（4bit量化版） = 1.2 * (7 * 4) / (32/4) = 4.2
14b（4bit量化版） = 1.2 * (14 * 4) / (32/4) = 8.4
32b（4bit量化版） = 1.2 * (32 * 4) / (32/4) = 19.2
72b（4bit量化版） = 1.2 * (72 * 4) / (32/4) = 43.199999999999996

量化

量化的方式一般分为三类：

1、权重量化：将模型权重从浮点数转换为低位宽整数。
2、激活量化：在模型推理过程中，将中间层的激活输出转换为低位宽的表示。
3、全模型量化：同时对权重和激活进行量化。
比如我们一般提到的4-bit量化、16-bit量化等，指的是将模型中的权重和激活值从高精度的浮点数转换为低精度的4位和16位来表示。这种方法可以显著减少模型的存储需求和计算复杂度，同时尽量保持模型的性能。

gptq 量化
awq 量化

Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
Qwen/Qwen2.5-7B-Instruct-AWQ

模型下载

git整体下载（安装lfs工具）

git lfs install
git clone https://hf-mirror.com/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

git忽略lfs文件下载，手动下载lfs文件覆盖

GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

huggingface-cli

安装huggingface_hub
pip install -U huggingface_hub
设置环境变量，使得下载时默认从国内的镜像站https://hf-mirror.com下载
export HF_ENDPOINT=https://hf-mirror.com

huggingface-cli download –resume-download Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 –local-dir Qwen/Qwen2-72B-Instruct-GPTQ-Int4 –local-dir-use-symlinks False

模型使用

推理

ollama（可以在gpu和cpu运行）
ollama run qwen2.5:7b
https://ollama.com/blog/openai-compatibility

curl http://localhost:11434/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “qwen2.5”,
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant.”
},
{
“role”: “user”,
“content”: “Hello!”
}
]
}’

https://ollama.com/blog/embedding-models

curl http://localhost:11434/api/embeddings -d ‘{
“model”: “mxbai-embed-large”,
“prompt”: “Llamas are members of the camelid family”
}’

ollama扩展：转换模型，运行自己的模型
ollama show –modelfile llama2:13b

大模型转换为 GGUF 以及使用 ollama 运行
https://zhuanlan.zhihu.com/p/715841536
Ollama Modelfile官方文档
https://blog.csdn.net/Chaos_Happy/article/details/138276172

大模型中 .safetensors 、.ckpt、.gguf、.pth、.bin
https://www.cnblogs.com/qcy-blog/p/18195616

vllm （必须要有显卡，启动后预占显存）
python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2-7B-Instruct-AWQ –served-model=Qwen2-7B-Instruct-AWQ –dtype=half –tensor-parallel-size=1 –quantization=awq –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1

python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2-7B-Instruct-GPTQ-Int4 –served-model=Qwen2-7B-Instruct-GPTQ-Int4 –dtype=half –tensor-parallel-size=1 –quantization=gptq –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1

python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2-7B-Instruct –served-model=Qwen2-7B-Instruct –dtype=half –tensor-parallel-size=1 –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1

–host
主机地址
–port
端口
–model
加载的模型路径
–trust-remote-code
允许模型加载来自 huggingface的远程代码
–tensor-parallel-size
采用的卡数，此处为单机多卡状态，多级多卡需先配置ray
–pipeline-parallel-size
多机配置，多级多卡可采用此配置，但也可以直接在tensor-parallel-size指定足够的卡
多机多卡文档后续补充
–served-model-name
此处为远程API调用时指定的模型名称
–device
使用的设备名称，常用的为cuda
–dtype
模型权重和激活的数据类型，常用的为bfloat16，不指定为auto
–max-model-len
模型处理的上下文长度。如果未指定，将自动从模型配置中派生
–gpu-memory-utilization
模型GPU利用率，未指定为0.9，长期稳定0.9，不会以为你的模型参数大或小而变化
–enable-prefix-caching
启用自动前缀缓存，可以避免在多次推理时重复计算相同的前缀
–enforce-eager
强制使用 Eager 模式，确保所有操作即时执行。默认为False，将在混合模式下使用 Eager 模式和 CUDA 图，以获得最大的性能和灵活性。

https://docs.vllm.ai/en/latest/getting_started/quickstart.html
https://blog.csdn.net/spicy_chicken123/article/details/135813924
https://blog.csdn.net/baiyipiao/article/details/141930442

Xinference
XINFERENCE_MODEL_SRC=modelscope xinference-local –host 0.0.0.0 –port 3014
https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html

训练（微调）

llama factory

LlamaFactory可视化微调大模型 - 参数详解
https://baijiahao.baidu.com/s?id=1804161804962042559&wfr=spider&for=pc

Unsloth

本地Api

OpenAI 接口
Chat Completions 会话补全：/v1/chat/completions
Completions （文本和代码）补全：/v1/completions
Embeddings 嵌入：/v1/embeddings
参考：万字长文｜关于 OpenAI 接口开发你应该知道的一切
https://blog.csdn.net/2401_85325557/article/details/140202624

智能体平台

fastgpt (知识库)
https://doc.tryfastgpt.ai/docs/intro/
https://github.com/labring/FastGPT

dify（工作流）
https://github.com/langgenius/dify
https://docs.dify.ai/

bisheng（工作流）
https://dataelem.feishu.cn/wiki/ZxW6wZyAJicX4WkG0NqcWsbynde
https://github.com/dataelement/bisheng

oneapi（openai接口聚合，聚合）
https://github.com/songquanpeng/one-api

AI开发基础

提示词工程
LangchainJS
openAI接口

python环境

conda 环境安装

conda env list
conda activate xinference

其它

curl http://10.19.93.53:3015/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “Qwen2-72B-Instruct-GPTQ-Int4”,
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant.”
},
{
“role”: “user”,
“content”: “Hello!”
}
],
“stream”: “TRUE”
}’

(vllm) jovyan@48d5f2c75bf5:~$ python3 -m vllm.entrypoints.openai.api_server –model=/home/jovyan/models/Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 –served-model=Qwen2-72B-Instruct-GPTQ-Int4 –dtype=half –tensor-parallel-size=1 –quantization=gptq –trust-remote-code –gpu-memory-utilization=0.9 –host=0.0.0.0 –port=3015 –max-model-len=4000 –max-num-seqs 1

curl -X ‘POST’
‘http://10.19.93.53:3014/v1/chat/completions ‘
-H ‘accept: application/json’
-H ‘Content-Type: application/json’
-d ‘{
“model”: “qwen2-instruct”,
“messages”: [
{
“role”: “system”,
“content”: “You are a helpful assistant.”
},
{
“role”: “user”,
“content”: “What is the largest animal?”
}
]
}’

curl http://10.19.93.53:3014/v1/embeddings
-H “Content-Type: application/json” -d ‘{
“model”: “m3e-small”,
“input”:[
“LlamaEdge is the easiest & fastest way to run customized and fine-tuned LLMs locally or on the edge.”
]}’

curl http://localhost:11434/v1/embeddings -d ‘{
-H “Content-Type: application/json” -d ‘{
“model”: “m3e-small”,
“input”:[
“LlamaEdge is the easiest & fastest way to run customized and fine-tuned LLMs locally or on the edge.”
]}’
}’

–oneapi
curl http://10.19.195.148:30001/v1/chat/completions
-H “Content-Type: application/json”
-H “Authorization: Bearer sk-6Ub0NErqby31pYV9Ac65CcAc678248Cd981eCaB90b87BaEe”
-d ‘{
“model”: “qwen2-instruct”,
“messages”: [
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “你是谁”}
]
}’

curl http://10.19.93.53:3014/v1/chat/completions
-H “Content-Type: application/json”
-d ‘{
“model”: “qwen2-instruct”,
“messages”: [
{“role”: “system”, “content”: “You are a helpful assistant.”},
{“role”: “user”, “content”: “你是谁”}
]
}’

菇太帷iのBlog

模型基础知识