云工厂网站建设,wordpress 足迹地图,品牌建设项目申报,电脑设计图制作软件app系列文章目录
vLLM (1) - Qwen2推理部署 vLLM (2) - 架构总览 vLLM (3) - Sequence SequenceGroup vLLM (4) - LLMEngine上篇 vLLM (5) - LLMEngine下篇 文章目录 系列文章目录前言一、类图二、LLM三、LLMEngine四、GPUExectuor五、Worker六、ModelRunner七、Cache…系列文章目录
vLLM (1) - Qwen2推理部署 vLLM (2) - 架构总览 vLLM (3) - Sequence SequenceGroup vLLM (4) - LLMEngine上篇 vLLM (5) - LLMEngine下篇 文章目录 系列文章目录前言一、类图二、LLM三、LLMEngine四、GPUExectuor五、Worker六、ModelRunner七、CacheEngine总结 前言
经过前面两篇的铺垫终于来到了解析LLMEngine的篇章。如下图所示LLMEngine主要有两部分构成右边部分包括Worker、CacheEngine和ModelRunner等重要的类它们在LLMEngine的初始化阶段就会用到工作内容包括模型加载KV Cache初始化等等这是本文中重点左边部分包括Scheduler和BlockSpaceManger用于调度用户请求并在过程中管理显存和内存这部分发生在LLMEngine的generate生成阶段将放到后续文章中。 一、类图
本篇重点讲述LLMEngine的初始化部分。由于代码调用相对复杂下面我使用类图的方式来表示不同的类之间的关系。同时在类图中只写上本篇所涉及的相关属性和方法避免其他属性和方法对本篇阅读造成干扰。建议该类图当结合后续代码一起使用。
# 类图-------------------------
| LLM |
-------------------------
| llm_engine: LLMEngine |
-------------------------||v
-------------------------
| LLMEngine |
-------------------------
| model_executor: GPUExecutor | # 执行器名字有点歧义项目有个子目录也叫model_exectuor
| - _initialize_kv_caches() | # 初始化kv_caches
| scheduler: Scheduler | # 调度器
| output_processor | # 输出处理器
-------------------------||v
-------------------------
| GPUExecutor |
-------------------------
| - _init_executor() | # 初始化执行器
| driver_worker: Worker | # worker
| |
| determine_num_available_blocks: Tuple[int, int] | # 确认可用的gpu blocks和cpu blocks
| initalize_cache() | # 初始化缓存先用全0张量为kv_cache占住内存
-------------------------||v
-------------------------
| Worker |
-------------------------
| model_runner: ModelRunner | # 加载和执行模型的部分
| cache_engine: CacheEngine | # 初始化和更新kv_cache的部分
| init_device() | # 初始化设备gpu
| load_model() | # 加载模型
-------------------------| || |v v
------------------------- -------------------------
| ModelRunner | | CacheEngine |
------------------------- -------------------------
| loader_model() | | gpu_cache |
| profile_run() | | - _allocate_kv_cache(): List[torch.Tensor] |
| capture_model() | | get_cache_block_size(...): int |
------------------------- -------------------------二、LLM
LLM是一个在给定prompt和sample paramters时使用指定的大语言模型生成文本的类其核心组件为self.llm_engineLLMEngine的实例化对象LLM的绝大多数工作由它来完成。 使用LLM的示例代码如下所示。1构建LLM实例化对象其初始化部分将完成llm_engine: LLMEngine的创建本文将重点2处理请求使用self.generate()方法完成了资源调度高效的应对用户请求输出文本后续文章讲述。
# 完整示例见系列文章的Qwen2推理篇
from vllm import LLMllm LLM(modelDEFAULT_CKPT_PATH) # DEFAULT_CKPT_PATH为模型名称或下载到本地的目录
outputs llm.generate(text, sampling_params) # text为输入文本sampling_params是采样参数三、LLMEngine
LLMEngine主要包含两个部分1model_executor2scheduler。model_executor主要负责模型相关的部分比如设备的选择模型的加载等等而scheduler用于资源的调度这部分在会模型推理阶段频繁使用。 结合代码来看一下LLMEngine在初始化环节都在干什么
创建model_executor根据model_config等一系列配置创建模型执行器对于一个不太富裕的从业者来说我们可能在一块单卡上跑vllm这时候model_executor是GPUExectuor如果你使用的硬件是Neuron或者TPU对应的model_executor就是NeuronExecutor或TPUExecutor另外model_config等配置是将输入和默认参数按照功能拆分出的多个配置项这里不赘述初始化kv_caches借由self.model_exectutor下一小节展开确定可用于kv_caches的内存空间并创建tensor占用这部分内存在Qwen2推理部署中的真实显存占用这一小节中我们已经观察到了这个动作并做了详细分析不清楚的可以去看一下构建scheduler资源调度一般都出现在模型推理阶段其他比如创建output_processor等这部分不是重点。
# vllm/engine/llm_engine.py
class LLMEngine:def __init__(self, ...):# ...self.model_executor executor_class(model_configmodel_config,cache_configcache_config,parallel_configparallel_config,scheduler_configscheduler_config,device_configdevice_config,lora_configlora_config,vision_language_configvision_language_config,speculative_configspeculative_config,load_configload_config,) # 1) 根据输入配置构建model_executorif not self.model_config.embedding_mode:self._initialize_kv_caches() # 2) 初始化kv caches# 3) 构建schedulerself.scheduler Scheduler(scheduler_config, cache_config, lora_config)# 4) 创建输出处理器这在最后输出的时候会用到# Create sequence output processor, e.g. for beam search or speculative decoding.self.output_processor (SequenceGroupOutputProcessor.create_output_processor(self.scheduler_config,self.detokenizer,self.scheduler,self.seq_counter,self.get_tokenizer_for_seq,stop_checkerStopChecker(self.scheduler_config.max_model_len,self.get_tokenizer_for_seq,),))def _initialize_kv_caches(self) - None:Initialize the KV cache in the worker(s).The workers will determine the number of blocks in both the GPU cacheand the swap CPU cache.num_gpu_blocks, num_cpu_blocks (self.model_executor.determine_num_available_blocks())if self.cache_config.num_gpu_blocks_override is not None:num_gpu_blocks_override self.cache_config.num_gpu_blocks_overridelogger.info(Overriding num_gpu_blocks%d with num_gpu_blocks_override%d, num_gpu_blocks,num_gpu_blocks_override)num_gpu_blocks num_gpu_blocks_overrideself.cache_config.num_gpu_blocks num_gpu_blocksself.cache_config.num_cpu_blocks num_cpu_blocksself.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)四、GPUExectuor
model_executor(比如GPUExecutor)在初始化阶段在干什么呢GPUExecutor继承自基类ExecutorBase在self.__init__()中调用了self._init_executor()方法具体包括如下
使用self._create_worker()创建worker实际上是通过WorkerWrapperBase来创建的worker不同的配置对应不同类型的worker默认情况下是Worker当你使用投机采样speculative decoding的时候则是SpecDecodeWorker合理使用投机采样能够提升解码效率worker初始化设备self.driver_worker.init_device()worker加载模型self.driver_worker.load_model() 前面提到GPUExecutor在被创建之后还用来完成kv_caches的初始化如上一节LLMEngine._initialize_kv_caches()方法所示这其中主要涉及GPUExecutor的两个方法self.determine_num_available_blocks()该方法返回了当前可用的gpu_blocks和cpu_blocks的数量block的意思是将gpu和cpu按照指定的大小block_size进行分块每一块对应一定大小的显存/内存initialize_cache()在确定num_gpu_blocks和num_cpu_blocks也就是确定有多少显存和内存可用于kv_caches之后就可以占据这部分资源进行缓存初始化 这边简单说明了GPUExecutor在前期的一些工作但这些操作基本依赖于它创建的worker我们下一小节来看。
# vllm/executor/gpu_executor.py
class GPUExecutor(ExecutorBase):def _init_executor(self) - None:Initialize the worker and load the model.assert self.parallel_config.world_size 1, (GPUExecutor only supports single GPU.)self.driver_worker self._create_worker() # 创建workerself.driver_worker.init_device() # 初始化设备self.driver_worker.load_model() # 加载模型def _create_worker(self,local_rank: int 0,rank: int 0,distributed_init_method: Optional[str] None):if self.speculative_config is None:worker_module_name vllm.worker.workerworker_class_name Workerelse:worker_module_name vllm.spec_decode.spec_decode_workerworker_class_name create_spec_workerwrapper WorkerWrapperBase(worker_module_nameworker_module_name,worker_class_nameworker_class_name,)wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,distributed_init_method))return wrapper.workerdef determine_num_available_blocks(self) - Tuple[int, int]:Determine the number of available KV blocks by invoking theunderlying worker.return self.driver_worker.determine_num_available_blocks()def initialize_cache(self, num_gpu_blocks: int, num_cpu_blocks) - None:Initialize the KV cache by invoking the underlying worker.# NOTE: This is logged in the executor because there can be 1 worker# with other executors. We could log in the engine level, but work# remains to abstract away the device for non-GPU configurations.logger.info(# GPU blocks: %d, # CPU blocks: %d, num_gpu_blocks,num_cpu_blocks)self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)五、Worker
Worker主要承载两部分功能model和cache分别对应于成员model_runner和cache_engine。
self.model_runner对于使用大模型生成本次举例的情形它是ModelRunner的实例对象而如果使用了embedding_mode它就是EmbeddingModelRunner的实例对象self.cache_engine是CacheEngine的实例对象在self.initial_cache()方法中主要就是初始化了self.cache_engine相关内容放在下下小节。 关于方法self.determine_num_available_blocks()它返回的是num_gpu_blocks和num_cpu_blocks两者获取逻辑分别如下
num_gpu_blocks在清空CUDA缓存之后执行一次前向传播profile模型的显存使用情况然后获取当前CUDA设备的空闲显存和总显存此时就能就算出峰值显存占用peak_memory那么可用于kv_caches的显存就是total_gpu_memory * self.cache_config.gpu_memory_utilization - peak_memory其中的gpu_memory_utilization是gpu使用率默认0.9因为缓存以block形式存在所以除以cache_block_size就能得到num_gpu_blocks其中cache_block_size是一个block所占用的字节数这会在CacheEngine中讲到num_cpu_blocks模型不会在cpu上进行运算但是可以在上面缓存必要时再swap到gpu上这部分内存大小是self.cache_config.swap_space_bytes默认是4GB。
# vllm/worker/worker.py
class Worker(WorkerBase):def __init__(self, ...) # 传入参数是一些配置项这边略去# 无关代码passModelRunnerClass (EmbeddingModelRunner ifself.model_config.embedding_mode else ModelRunner)self.model_runner ModelRunnerClass(model_config,parallel_config,scheduler_config,device_config,cache_config,load_configload_config,lora_configself.lora_config,kv_cache_dtypeself.cache_config.cache_dtype,is_driver_workeris_driver_worker,vision_language_configvision_language_config,)# Uninitialized cache engine. Will be initialized by# initialize_cache.self.cache_engine: CacheEngine# Initialize gpu_cache as embedding models dont initialize kv_cachesself.gpu_cache: Optional[List[torch.tensor]] None# ------------------- GPUExecutor中被调用来初始化的部分 ------------------- # def init_device(self) - None:if self.device_config.device.type cuda:os.environ[TORCH_NCCL_AVOID_RECORD_STREAMS] 1# This env var set by Ray causes exceptions with graph building.os.environ.pop(NCCL_ASYNC_ERROR_HANDLING, None)self.device torch.device(fcuda:{self.local_rank})torch.cuda.set_device(self.device)_check_if_gpu_supports_dtype(self.model_config.dtype)torch.cuda.empty_cache()self.init_gpu_memory torch.cuda.mem_get_info()[0]else:raise RuntimeError(fNot support device type: {self.device_config.device})# Initialize the distributed environment.init_worker_distributed_environment(self.parallel_config, self.rank,self.distributed_init_method,self.local_rank)# Set random seed.set_random_seed(self.model_config.seed)def load_model(self):self.model_runner.load_model()# ------------------- model runner相关 ------------------- # torch.inference_mode()def determine_num_available_blocks(self) - Tuple[int, int]:Profiles the peak memory usage of the model to determine how manyKV blocks may be allocated without OOMs.# Profile the memory usage of the model and get the maximum number of# cache blocks that can be allocated with the remaining free memory.torch.cuda.empty_cache()# Execute a forward pass with dummy inputs to profile the memory usage# of the model.self.model_runner.profile_run()# Calculate the number of blocks that can be allocated with the# profiled peak memory.torch.cuda.synchronize()free_gpu_memory, total_gpu_memory torch.cuda.mem_get_info()# NOTE(woosuk): Here we assume that the other processes using the same# GPU did not change their memory usage during the profiling.peak_memory self.init_gpu_memory - free_gpu_memoryassert peak_memory 0, (Error in memory profiling. This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.)cache_block_size self.get_cache_block_size_bytes()num_gpu_blocks int((total_gpu_memory * self.cache_config.gpu_memory_utilization -peak_memory) // cache_block_size)num_cpu_blocks int(self.cache_config.swap_space_bytes //cache_block_size)num_gpu_blocks max(num_gpu_blocks, 0)num_cpu_blocks max(num_cpu_blocks, 0)if self.model_runner.lora_manager:self.model_runner.remove_all_loras()gc.collect()torch.cuda.empty_cache()return num_gpu_blocks, num_cpu_blocks# --------------------- cache相关 ------------------------def initialize_cache(self, num_gpu_blocks: int,num_cpu_blocks: int) - None:Allocate GPU and CPU KV cache with the specified number of blocks.This also warms up the model, which may record CUDA graphs.raise_if_cache_size_invalid(num_gpu_blocks,self.cache_config.block_size,self.model_config.max_model_len)self.cache_config.num_gpu_blocks num_gpu_blocksself.cache_config.num_cpu_blocks num_cpu_blocksself._init_cache_engine()self._warm_up_model()def _init_cache_engine(self):assert self.cache_config.num_gpu_blocks is not Noneself.cache_engine CacheEngine(self.cache_config, self.model_config,self.parallel_config)self.gpu_cache self.cache_engine.gpu_cachedef _warm_up_model(self) - None:if not self.model_config.enforce_eager:self.model_runner.capture_model(self.gpu_cache)# Reset the seed to ensure that the random state is not affected by# the model initialization and profiling.set_random_seed(self.model_config.seed)六、ModelRunner
这部分主要讲一下ModelRunner的两个方法self.profile_run()和self.capture_model()。
self.profile_run() 是用于跑一跑dummy input然后看一下具体的内存使用情况最关键的代码是self.execute_model(seqs, kv_caches)也就是说我们需要准备好输入seqs和缓存kv_caches其中kv_caches用于模型中Attention的计算初始都为None。详细注释见下方。
class ModelRunner:# ...torch.inference_mode()def profile_run(self) - None:# top-k采样获取内存使用情况sampling_params SamplingParams(top_p0.99, top_kself.vocab_size - 1)# 一个batch中最大处理的token数量典型的32kmax_num_batched_tokens self.scheduler_config.max_num_batched_tokens# 最大的序列数量典型的256max_num_seqs self.scheduler_config.max_num_seqs# profile的时候要求序列数量为max_num_seqstoken总数等于max_num_batched_tokensseqs: List[SequenceGroupMetadata] []model_config self.model_config# lora: pass# vlm: passfor group_id in range(max_num_seqs):# 均分所有token计算每个seq的长度seq_len (max_num_batched_tokens // max_num_seqs (group_id max_num_batched_tokens % max_num_seqs))# SequenceData和SequenceGroupMetadata已经在前面文章中讲过不再赘述# 构造dummy inputseq_data SequenceData([0] * seq_len)dummy_multi_modal_data Noneseq SequenceGroupMetadata(request_idstr(group_id),is_promptTrue,seq_data{group_id: seq_data},sampling_paramssampling_params,block_tablesNone,lora_requestdummy_lora_requests_per_seq[group_id]if dummy_lora_requests_per_seq else None,multi_modal_datadummy_multi_modal_data,)seqs.append(seq)# 构造kv caches由于尚未开始推理初始化为Nonenum_layers self.model_config.get_num_layers(self.parallel_config)kv_caches [None] * num_layers# 执行模型self.execute_model(seqs, kv_caches)# cuda同步torch.cuda.synchronize()returntorch.inference_mode()def execute_model(self,seq_group_metadata_list: Optional[List[SequenceGroupMetadata]],kv_caches: List[torch.Tensor],) - Optional[SamplerOutput]:# 准备输入张量(input_tokens, input_positions, attn_metadata, sampling_metadata,lora_requests, lora_mapping, multi_modal_kwargs) self.prepare_input_tensors(seq_group_metadata_list)# lora: pass# 仅在decode阶段使用cuda graph它能提升效率prefill_meta attn_metadata.prefill_metadata # 具体是怎样的暂时不必管它decode_meta attn_metadata.decode_metadataif prefill_meta is None and decode_meta.use_cuda_graph:graph_batch_size input_tokens.shape[0]model_executable self.graph_runners[graph_batch_size]else:model_executable self.model# 模型具体执行模型在vllm/model_executor/models/中有定义这边找到qwen2.py文件hidden_states model_executable(input_idsinput_tokens,positionsinput_positions,kv_cacheskv_caches,attn_metadataattn_metadata,**multi_modal_kwargs,)# Compute the logits.logits self.model.compute_logits(hidden_states, sampling_metadata)# Only perform sampling in the driver worker.if not self.is_driver_worker:return None# Sample the next token. 采样output self.model.sample(logitslogits,sampling_metadatasampling_metadata,)return outputself.capture_model()使用CUDA Graph技术仅在解码过程中使用捕获一个模型的执行过程以便在后续的推理过程中可以重用这个捕获的图从而提高性能代码中给出了简单注释
class ModelRunner:# ...torch.inference_mode()def capture_model(self, kv_caches: List[torch.Tensor]) - None:Cuda graph capture a model. ...CUDA Graph主要用于解码阶段因为对于较大的批量大小CUDA Graph的性能提升不明显并且由于CUDA Graph需要固定大小的张量支持大或可变批量大小需要较高的GPU内存开销# 提示信息可以了解一下assert not self.model_config.enforce_eagerlogger.info(Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set enforce_eagerTrue or use --enforce-eager in the CLI.)logger.info(CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.)start_time time.perf_counter()# dummy inputsmax_batch_size max(_BATCH_SIZES_TO_CAPTURE)input_tokens torch.zeros(max_batch_size, dtypetorch.long).cuda()input_positions torch.zeros(max_batch_size, dtypetorch.long).cuda()slot_mapping torch.empty(max_batch_size, dtypetorch.long).cuda()slot_mapping.fill_(_PAD_SLOT_ID)seq_lens torch.ones(max_batch_size, dtypetorch.int32).cuda()block_tables torch.from_numpy(self.graph_block_tables).cuda()# 创建用于存储输出隐藏状态的缓冲区hidden_states# 该缓冲区将在第一次图捕获后填充。hidden_states: Optional[torch.Tensor] None# 需要捕获的若干batch sizegraph_batch_size _get_graph_batch_size(self.scheduler_config.max_num_seqs)batch_size_capture_list [bs for bs in _BATCH_SIZES_TO_CAPTURE if bs graph_batch_size]# 捕获CUDA Graphgraph_capture()是上下文管理器一些并行策略with graph_capture() as graph_capture_context:# NOTE: Capturing the largest batch size first may help reduce the# memory usage of CUDA graph.for batch_size in reversed(batch_size_capture_list):# Create dummy attn_metadata.attn_metadata self.attn_backend.make_metadata(num_prefills0,num_prefill_tokens0,num_decode_tokensbatch_size,slot_mappingslot_mapping[:batch_size],seq_lensNone,seq_lens_tensorseq_lens[:batch_size],max_query_lenNone,max_prefill_seq_len0,max_decode_seq_lenself.max_seq_len_to_capture,query_start_locNone,seq_start_locNone,context_lens_tensorNone,block_tablesblock_tables[:batch_size],use_cuda_graphTrue,)if self.lora_config:lora_mapping LoRAMapping([0] * batch_size,[0] * batch_size,)self.set_active_loras(set(), lora_mapping)# 创建CUDAGraphRunner实例并使用capture方法捕获模型的执行过程graph_runner CUDAGraphRunner(self.model)hidden_states graph_runner.capture(input_tokens[:batch_size],input_positions[:batch_size],hidden_states[:batch_size]if hidden_states is not None else None,kv_caches,attn_metadata,memory_poolself.graph_memory_pool,streamgraph_capture_context.stream,)self.graph_memory_pool graph_runner.graph.pool()# graph_runner存起来self.graph_runners[batch_size] graph_runnerend_time time.perf_counter()elapsed_time end_time - start_time# This usually takes 10 seconds.logger.info(Graph capturing finished in %.0f secs., elapsed_time)七、CacheEngine
self._allocate_kv_cache()该方法就是收集初始化kv_cache用处就是先占住gpu和cpu资源self.get_cache_block_size()这个方法计算了每一个block对应的字节数也就是上面Worker中的cache_block_size。1每个block存放block_size个token的kv_caches2单个token对应的k的元素个数为num_heads * head_size * num_layersv也一样3通过1和2计算出block中包含的元素个数根据不同的数据类型就能得到这个block占用的字节数了见注释self.swap_in()、self.swap_out()和self.copy()这几个方法并没有在初始化阶段用到但这边解释一下当处理大量用户请求的时候涉及到资源的分配比如在此之前有部分数据时缓存在cpu上的现在gpu上有剩余可用的显存了那就应该使用self.swap_in()将数据搬到gpu上进行计算反之gpu可用显存都占满了可能会将原本在gpu上的部分缓存搬至cpu等待机会搬回gpu此时使用self.swap_out()。
# vllm/worker/cache_engine.py
class CacheEngine:Manages the KV cache.This class is responsible for initializing and managing the GPU and CPU KVcaches. It also provides methods for performing KV cache operations, suchas swapping and copying.def __init__(self,cache_config: CacheConfig,model_config: ModelConfig,parallel_config: ParallelConfig,) - None:self.cache_config cache_config # 传入的配置self.model_config model_configself.parallel_config parallel_configself.head_size model_config.get_head_size() # 多头注意力每个头的维度self.num_layers model_config.get_num_layers(parallel_config) # 每个pp对应的层的个数self.num_kv_heads model_config.get_num_kv_heads(parallel_config) # 每个tp对应的kv heads的个数self.block_size cache_config.block_sizeself.num_gpu_blocks cache_config.num_gpu_blocksself.num_cpu_blocks cache_config.num_cpu_blocksif cache_config.cache_dtype auto:self.dtype model_config.dtypeelse:self.dtype STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]# Get attention backend.self.attn_backend get_attn_backend(model_config.get_num_attention_heads(parallel_config),self.head_size,self.num_kv_heads,model_config.get_sliding_window(),model_config.dtype,cache_config.cache_dtype,self.block_size,)# Initialize the cache.self.gpu_cache self._allocate_kv_cache(self.num_gpu_blocks, cuda)self.cpu_cache self._allocate_kv_cache(self.num_cpu_blocks, cpu)def _allocate_kv_cache(self,num_blocks: int,device: str,) - List[torch.Tensor]:Allocates KV cache on the specified device.kv_cache_shape self.attn_backend.get_kv_cache_shape(num_blocks, self.block_size, self.num_kv_heads, self.head_size) # 计算kv_cache的形状pin_memory is_pin_memory_available() if device cpu else Falsekv_cache: List[torch.Tensor] []for _ in range(self.num_layers): # 添加每一层的kv缓存# null block in CpuGpuBlockAllocator requires at least that# block to be zeroed-out.# We zero-out everything for simplicity.kv_cache.append(torch.zeros(kv_cache_shape,dtypeself.dtype,pin_memorypin_memory,devicedevice))return kv_cachedef swap_in(self, src_to_dst: torch.Tensor) - None:for i in range(self.num_layers):self.attn_backend.swap_blocks(self.cpu_cache[i], self.gpu_cache[i],src_to_dst)def swap_out(self, src_to_dst: torch.Tensor) - None:for i in range(self.num_layers):self.attn_backend.swap_blocks(self.gpu_cache[i], self.cpu_cache[i],src_to_dst)def copy(self, src_to_dsts: torch.Tensor) - None:self.attn_backend.copy_blocks(self.gpu_cache, src_to_dsts)staticmethoddef get_cache_block_size(cache_config: CacheConfig,model_config: ModelConfig,parallel_config: ParallelConfig,) - int:head_size model_config.get_head_size()num_heads model_config.get_num_kv_heads(parallel_config)num_layers model_config.get_num_layers(parallel_config)# block_size默认16, 也就是存16个token的kv_caches# 具体计算以字节为单位的大小时需要考虑kv的大小也就是num_heads * head_size * num_layerskey_cache_block cache_config.block_size * num_heads * head_sizevalue_cache_block key_cache_blocktotal num_layers * (key_cache_block value_cache_block)if cache_config.cache_dtype auto:dtype model_config.dtypeelse:dtype STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]dtype_size get_dtype_size(dtype) # bf16对应的dtype_size就是2return dtype_size * total 总结
本篇主要介绍了LLMEngine初始化部分的内容涉及了GPUExecutor、Worker、ModelRunner和CacheEngine等多个类的方法有助于理解在使用vllm文本生成之前初始化阶段的工作原理。对于LLMEngine的另一个重要组成部分Scheduler会在后续文章(请求处理阶段)中讲述。