DeepSeek LLM：先进语言模型的介绍与应用

一、技术背景

DeepSeek LLM 是一款先进的语言模型，拥有 670 亿参数，基于 2 万亿英文和中文 tokens 的庞大数据集从头开始训练。为推动研究发展，DeepSeek LLM 7B/67B Base 和 DeepSeek LLM 7B/67B Chat 已对研究社区开源。该模型在推理、编码、数学和中文理解等方面表现出色，具备卓越的通用能力。

二、实现步骤

（一）模型下载

可从 Huggingface 下载 DeepSeek LLM 7B/67B 的基础模型和聊天模型。同时，基础模型训练过程中的中间检查点可通过 AWS CLI 从 AWS S3 下载。

# DeepSeek-LLM-7B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-7B-Base <local_path> --recursive --request-payer

# DeepSeek-LLM-67B-Base
aws s3 cp s3://deepseek-ai/DeepSeek-LLM/DeepSeek-LLM-67B-Base <local_path> --recursive  --request-payer

（二）环境安装

在 Python >= 3.8 的环境下，运行以下命令安装必要依赖：

1	`pip install -r requirements.txt`

（三）模型推理

1. 使用 Huggingface’s Transformers 进行推理

文本补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-llm-67b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

聊天补全

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-llm-67b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Who are you?"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

2. 使用 vLLM 进行高吞吐量推理

文本补全

from vllm import LLM, SamplingParams

tp_size = 4 # Tensor Parallelism
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
model_name = "deepseek-ai/deepseek-llm-67b-base"
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size)

prompts = [
    "If everyone in a country loves one another,",
    "The research should also focus on the technologies",
    "To determine if the label is correct, we need to"
]
outputs = llm.generate(prompts, sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

聊天补全

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

tp_size = 4 # Tensor Parallelism
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=100)
model_name = "deepseek-ai/deepseek-llm-67b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, trust_remote_code=True, gpu_memory_utilization=0.9, tensor_parallel_size=tp_size)

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "What can you do?"}],
    [{"role": "user", "content": "Explain Transformer briefly."}],
]
# Avoid adding bos_token repeatedly
prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

sampling_params.stop = [tokenizer.eos_token]
outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

三、核心代码解释

（一）模型推理代码

使用 Huggingface’s Transformers 进行推理时，通过 AutoTokenizer 和 AutoModelForCausalLM 加载模型和分词器，设置生成配置后进行文本或聊天补全。
使用 vLLM 进行推理时，通过 LLM 类加载模型，设置采样参数后进行高吞吐量的文本或聊天补全。

（二）数据处理代码

在聊天补全中，使用 tokenizer.apply_chat_template 处理输入消息，将其转换为模型可接受的输入格式。

四、最佳实践

在进行模型推理时，根据实际需求调整生成配置，如 max_new_tokens 控制生成文本的长度，temperature 和 top_p 控制生成文本的随机性。
对于大规模推理任务，建议使用 vLLM 进行高吞吐量推理，以提高效率。

五、常见问题

（一）能否提供 tokenizer.model 文件用于模型量化？

DeepSeek LLM 使用 HuggingFace Tokenizer 实现字节级 BPE 算法，目前无法直接将分词器转换为 SentencePiece 分词器。正在为开源量化方法做出贡献以方便使用 HuggingFace Tokenizer。

（二）如何生成 GGUF 模型？

可按以下步骤生成 GGUF 模型：

git clone https://github.com/DOGEwbx/llama.cpp.git
cd llama.cpp
git checkout regex_gpt2_preprocess
# set up the environment according to README
make
python3 -m pip install -r requirements.txt
# generate GGUF model
python convert-hf-to-gguf.py <MODEL_PATH> --outfile <GGUF_PATH> --model-name deepseekllm
# use q4_0 quantization as an example
./quantize <GGUF_PATH> <OUTPUT_PATH> q4_0
./main -m <OUTPUT_PATH> -n 128 -p <PROMPT>

（三）GPTQ(exllamav2) 支持情况

exllamav2 已支持 HuggingFace Tokenizer，请拉取最新版本尝试。

（四）GPU 内存使用情况

不同批次大小和序列长度下，7B 和 67B 模型的推理峰值内存使用情况不同。7B 模型使用 1 块 NVIDIA A100 - PCIE - 40GB GPU 进行推理，67B 模型使用 8 块 NVIDIA A100 - PCIE - 40GB GPU 进行推理。具体内存使用情况可参考文档中的表格。

六、模型局限性

过度依赖训练数据：模型在大量文本数据上训练，可能引入数据中的偏差，导致生成有偏差或歧视性的响应。
幻觉问题：模型有时会生成听起来合理但事实上不正确或无依据的响应。
重复问题：生成的响应可能出现重复，降低输出的多样性和吸引力。

七、许可证

代码仓库遵循 MIT 许可证，DeepSeek LLM 基础/聊天模型的使用受模型许可证约束，该系列模型支持商业使用。

八、引用信息

@article{deepseek-llm,
  author = {DeepSeek-AI},
  title = {DeepSeek LLM: Scaling Open-Source Language Models with Longtermism},
  journal = {arXiv preprint arXiv:2401.02954},
  year = {2024},
  url = {https://github.com/deepseek-ai/DeepSeek-LLM}
}

九、联系方式

如有任何问题，请提出问题或通过 [email protected] 联系。

人工智能 > 自然语言处理 > 语言模型研究

#Python #自然语言处理 #DeepSeek LLM #模型评估 #模型推理

DeepSeek LLM：先进语言模型的介绍与应用

https://119291.xyz/posts/deepseek-llm-introduction-and-application/

作者

发布于

2025年5月20日

许可协议

在不安装Microsoft Office的情况下用C#创建Excel（.XLS和.XLSX）文件上一篇

函数与方法的区别下一篇