当前位置：首页 > news >正文

用wang域名做购物网站怎么样wordpress 换中文

news 2026/1/2 17:07:45

用wang域名做购物网站怎么样,wordpress 换中文,怎样做网络宣传,wordpress 身份认证api最近这一两周看到不少互联网公司都已经开始秋招发放Offer。不同以往的是#xff0c;当前职场环境已不再是那个双向奔赴时代了。求职者在变多#xff0c;HC 在变少#xff0c;岗位要求还更高了。最近#xff0c;我们又陆续整理了很多大厂的面试题#xff0c;帮助一些球…最近这一两周看到不少互联网公司都已经开始秋招发放Offer。不同以往的是当前职场环境已不再是那个双向奔赴时代了。求职者在变多HC 在变少岗位要求还更高了。最近我们又陆续整理了很多大厂的面试题帮助一些球友解惑答疑分享技术面试中的那些弯弯绕绕。《大模型面试宝典》(2024版) 正式发布喜欢本文记得收藏、关注、点赞。更多实战和面试交流文末加入我们技术交流本文基于 llama 模型的源码学习相对位置编码的实现方法本文不细究绝对位置编码和相对位置编码的数学原理。大模型新人在学习中容易困惑的几个问题为什么一定要在 transformer 中使用位置编码相对位置编码在 llama 中是怎么实现的大模型的超长文本预测和位置编码有什么关系 01 为什么需要位置编码很多初学者都会读到这样一句话transformer 使用位置编码的原因是它不具备位置信息。大家都只把这句话当作公理却很少思考这句话到底是什么意思这句话的意思是如果没有位置编码那么 “床前明月”、“前床明月”、“前明床月” 这几个输入会预测出完全一样的文本。也就是说不管你输入的 prompt 顺序是什么只要 prompt 的文本是相同的那么模型 decode 的文本就只取决于 prompt 的最后一个 token。 import torch from torch import nn import mathbatch 1 dim 10 num_head 2 embedding nn.Embedding(5, dim) q_matrix nn.Linear(dim, dim, biasFalse) k_matrix nn.Linear(dim, dim, biasFalse) v_matrix nn.Linear(dim, dim, biasFalse)x embedding(torch.tensor([1,2,3])).unsqueeze(0) y embedding(torch.tensor([2,1,3])).unsqueeze(0)def attention(input):q q_matrix(input).view(batch, -1, num_head, dim // num_head).transpose(1, 2)k k_matrix(input).view(batch, -1, num_head, dim // num_head).transpose(1, 2)v v_matrix(input).view(batch, -1, num_head, dim // num_head).transpose(1, 2)attn_weights torch.matmul(q, k.transpose(2, 3)) / math.sqrt(dim // num_head)attn_weights nn.functional.softmax(attn_weights, dim-1)outputs torch.matmul(attn_weights, v).transpose(1, 2).reshape(1, len([1,2,3]), dim)print(outputs)attention(x) attention(y)执行上面的代码会发现虽然 x 和 y 交换了第一个 token 和第二个 token 的输入顺序但是第三个 token 的计算结果完全没有发生改变那么模型预测第四个 token 时便会得到相同的结果。如果有读者对矩阵运算感到混淆的话可以看看下面的简单推导可以看出当第一个 token 与第二个 token 交换顺序后模型输出矩阵的第一维和第二维也交换了顺序但输出的值完全没有变化。第三个 token 的输出结果也是完全没有受到影响这也就是前面说的如果没有位置编码模型 decode 的文本就只取决于 prompt 的最后一个 token。不过需要注意的是由于 attention_mask 的存在前置位 token 看不到后置位 token所以即使不加位置编码transformer 的输出还是会受到 token 的位置影响。 02 相对位置编码的实现我们以 modeling_llama.py 的源码为例来学习相对位置编码的实现方法。 class LlamaRotaryEmbedding(torch.nn.Module):def __init__(self, dim, max_position_embeddings2048, base10000, deviceNone):super().__init__()inv_freq 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))self.register_buffer(inv_freq, inv_freq)# Build here to make torch.jit.trace work.self.max_seq_len_cached max_position_embeddingst torch.arange(self.max_seq_len_cached, deviceself.inv_freq.device, dtypeself.inv_freq.dtype)freqs torch.einsum(i,j-ij, t, self.inv_freq)# Different from paper, but it uses a different permutation in order to obtain the same calculationemb torch.cat((freqs, freqs), dim-1)self.register_buffer(cos_cached, emb.cos()[None, None, :, :], persistentFalse)self.register_buffer(sin_cached, emb.sin()[None, None, :, :], persistentFalse)def forward(self, x, seq_lenNone):# x: [bs, num_attention_heads, seq_len, head_size]# This if block is unlikely to be run after we build sin/cos in __init__. Keep the logic here just in case.if seq_len self.max_seq_len_cached:self.max_seq_len_cached seq_lent torch.arange(self.max_seq_len_cached, devicex.device, dtypeself.inv_freq.dtype)freqs torch.einsum(i,j-ij, t, self.inv_freq)# Different from paper, but it uses a different permutation in order to obtain the same calculationemb torch.cat((freqs, freqs), dim-1).to(x.device)self.register_buffer(cos_cached, emb.cos()[None, None, :, :], persistentFalse)self.register_buffer(sin_cached, emb.sin()[None, None, :, :], persistentFalse)return (self.cos_cached[:, :, :seq_len, ...].to(dtypex.dtype),self.sin_cached[:, :, :seq_len, ...].to(dtypex.dtype),)def rotate_half(x):Rotates half the hidden dims of the input.x1 x[..., : x.shape[-1] // 2]x2 x[..., x.shape[-1] // 2 :]return torch.cat((-x2, x1), dim-1)def apply_rotary_pos_emb(q, k, cos, sin, position_ids):# The first two dimensions of cos and sin are always 1, so we can squeeze them.cos cos.squeeze(1).squeeze(0) # [seq_len, dim]sin sin.squeeze(1).squeeze(0) # [seq_len, dim]cos cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]sin sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]q_embed (q * cos) (rotate_half(q) * sin)k_embed (k * cos) (rotate_half(k) * sin)return q_embed, k_embed相对位置编码在 attention 中的应用方法如下 self.rotary_emb LlamaRotaryEmbedding(self.head_dim, max_position_embeddingsself.max_position_embeddings) cos, sin self.rotary_emb(value_states, seq_lenkv_seq_len)query_states, key_states apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)if past_key_value is not None:# reuse k, v, self_attentionkey_states torch.cat([past_key_value[0], key_states], dim1)value_states torch.cat([past_key_value[1], value_states], dim1) 根据 value_states 矩阵的形状去调取 cos 和 sin 两个 tensor cos 与 sin 的维度均是 batch_size * head_num * seq_len * head_dim 利用 apply_rotary_pos_emb 去修改 query_states 和 key_states 两个 tensor得到新的 qk 矩阵需要注意的是在解码时position_ids 的长度是和输入 token 的长度保持一致的prompt 是 4 个 token 的话。第一次解码时position_ids: tensor([[0, 1, 2, 3]], device‘cuda:0’)q 矩阵与 k 矩阵的相对位置编码信息通过 apply_rotary_pos_emb() 获得第二次解码时position_ids: tensor([[4]], device‘cuda:0’)当前 token 的相对位置编码信息通过 apply_rotary_pos_emb() 获得。前 4 个 token 的相对位置编码信息则是通过 key_states torch.cat([past_key_value[0], key_states], dim1) 集成到 k 矩阵中 …… …… 以上代码的公式均可以从苏神原文中找到。这些代码可以从 llama 模型中剥离出来直接执行如果感到困惑可以像下面一样将 apply_rotary_pos_emb() 的整个过程给 print 出来观察一下 head_num, head_dim, kv_seq_len 8, 20, 5 position_ids torch.tensor([[0, 1, 2, 3, 4]]) query_states torch.randn(1, head_dim, kv_seq_len, head_dim) key_states torch.randn(1, head_dim, kv_seq_len, head_dim) value_states torch.randn(1, head_dim, kv_seq_len, head_dim) rotary_emb LlamaRotaryEmbedding(head_dim) cos, sin rotary_emb(value_states, seq_lenkv_seq_len) print(cos, sin) query_states, key_states apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)03 位置编码与长度外推长度外推指的是大模型在训练的只见过长度为 X 的文本但在实际应用时却有如下情况我们假设 X 的取值为 4096那么也就意味着模型自始至终没有见到过 pos_id 4096 的位置编码进而导致模型的预测结果完全不可控。因此解决长度外推问题的关键便是如何让模型见到比训练文本更长的位置编码。以上关于文本外推的介绍均是比较大白话的理解只是为了强调位置编码很重要这一观点。

查看全文

http://www.dnsts.com.cn/news/168752.html