00242 vLLM 学习笔记


前言

vLLM是一个快速且易于使用的LLM推理和服务库。

vLLM速度很快:

  • 最先进的服务吞吐量

  • 使用PagedNote有效管理注意力键和值内存

  • 传入请求的连续批处理

  • 使用CUDA/HIP图快速执行模型

  • 量化:GPTQ、AWQ、INT4、INT8和FP8

  • 优化了CUDA内核,包括与FlashNote和FlashInfer的集成。

  • 推测译码

  • 组块预填充

vLLM灵活且易于使用:

  • 与流行的HuggingFace模型无缝集成

  • 高吞吐量服务于各种解码算法,包括并行采样、波束搜索等

  • 分布式推理的张量并行性和管道并行性支持

  • 流式输出

  • 与OpenAI兼容的API服务器

  • 支持英伟达图形处理器、AMD图形处理器和图形处理器、英特尔图形处理器、高迪®加速器和图形处理器、PowerPC图形处理器、TPU和AWS训练和推理加速器。

  • 前缀缓存支持

  • 多lora支持

Operating System: Ubuntu 22.04.4 LTS

参考文档

  1. Welcome to vLLM!
  2. vllm-project/vllm
  3. Quickstart
  4. Distributed Inference and Serving
  5. OpenAI Chat Completion Client

Offline Batched Inference

# pip install vllm

from vllm import LLM, SamplingParams

prompts = [
    "write a quick sort algorithm.",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=2048)

llm = LLM(model="Qwen/Qwen2.5-Coder-0.5B-Instruct")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

OpenAI-Compatible Server

"""
$ pip install vllm
$ vllm serve Qwen/Qwen2.5-Coder-0.5B-Instruct --tensor-parallel-size 8
$ curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-Coder-0.5B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-Coder-0.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
"""

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

completion = client.completions.create(
    model="Qwen/Qwen2.5-Coder-0.5B-Instruct",
    prompt="write a quick sort algorithm in Python.",
    max_tokens=256
)

print(f"Completion result:\n{completion}")
print(f"{'-'*42}")
print(f"response:\n{completion.choices[0].text}")

print(f"{'-'*42}{'qwq'}{'-'*42}")

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-Coder-0.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "write a quick sort algorithm in Python."},
    ]
)

print(f"Chat response:\n{chat_response}")
print(f"{'-'*42}")
print(chat_response.choices[0].message.content)

print(f"{'-'*42}{'qwq'}{'-'*42}")

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "Who won the world series in 2020?"
    }, {
        "role":
        "assistant",
        "content":
        "The Los Angeles Dodgers won the World Series in 2020."
    }, {
        "role": "user",
        "content": "Where was it played?"
    }],
    model=model,
)

print(f"Chat completion results:\n{chat_completion}")
print(f"{'-'*42}")
print(chat_completion.choices[0].message.content)

结语

第二百四十二篇博文写完,开心!!!!

今天,也是充满希望的一天。


文章作者: LuYF-Lemon-love
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 LuYF-Lemon-love !
  目录