当前位置: 首页 > news >正文

手机网站吧做网站后台服务器什么最好

手机网站吧,做网站后台服务器什么最好,seo搜索引擎是什么,网络科技有限公司英文动手学习RAG: 向量模型动手学习RAG: moka-ai/m3e 模型微调deepspeed与对比学习动手学习RAG#xff1a;迟交互模型colbert微调实践 bge-m3 1. 环境准备 pip install transformers pip install open-retrievals注意安装时是pip install open-retrievals#xff0c;但调用时只…动手学习RAG: 向量模型动手学习RAG: moka-ai/m3e 模型微调deepspeed与对比学习动手学习RAG迟交互模型colbert微调实践 bge-m3 1. 环境准备 pip install transformers pip install open-retrievals注意安装时是pip install open-retrievals但调用时只需要import retrievals欢迎关注最新的更新 https://github.com/LongxingTan/open-retrievals 2. 使用M3E模型 from retrievals import AutoModelForEmbeddingembedder AutoModelForEmbedding.from_pretrained(moka-ai/m3e-base, pooling_methodmean) embeddersentences [* Moka 此文本嵌入模型由 MokaAI 训练并开源训练脚本使用 uniem,* Massive 此文本嵌入模型通过**千万级**的中文句对数据集进行训练,* Mixed 此文本嵌入模型支持中英双语的同质文本相似度计算异质文本检索等功能未来还会支持代码检索ALL in one ]embeddings embedder.encode(sentences)for sentence, embedding in zip(sentences, embeddings):print(Sentence:, sentence)print(Embedding:, embedding)print()3. deepspeed 微调m3e模型 数据仍然采用之前介绍的t2-ranking数据集 deepspeed配置保存为 ds_zero2_no_offload.json. 不过虽然设置了zero2这里我只用了一张卡. 但deepspeed也很容易扩展到多卡或多机多卡 关于deepspeed的分布式设置可参考Tranformer分布式特辑 {fp16: {enabled: auto,loss_scale: 0,loss_scale_window: 100,initial_scale_power: 16,hysteresis: 2,min_loss_scale: 1e-10},zero_optimization: {stage: 2,allgather_partitions: true,allgather_bucket_size: 1e8,overlap_comm: true,reduce_scatter: true,reduce_bucket_size: 1e8,contiguous_gradients: true},gradient_accumulation_steps: auto,gradient_clipping: auto,steps_per_print: 2000,train_batch_size: auto,train_micro_batch_size_per_gpu: auto,wall_clock_breakdown: false }这里稍微修改了open-retrievals这里的代码主要是修改了导入为包的导入而不是相对引用。保存文件为embed.py Embedding fine tune pipelineimport logging import os import pickle from dataclasses import dataclass, field from pathlib import Path from typing import List, Optionalimport torch from torch.utils.data import DataLoader from transformers import AutoTokenizer, HfArgumentParser, TrainingArguments, set_seedfrom retrievals import (EncodeCollator,EncodeDataset,PairCollator,RetrievalTrainDataset,TripletCollator, ) from retrievals.losses import AutoLoss, InfoNCE, SimCSE, TripletLoss from retrievals.models.embedding_auto import AutoModelForEmbedding from retrievals.trainer import RetrievalTrainer# os.environ[WANDB_LOG_MODEL] false logger logging.getLogger(__name__)dataclass class ModelArguments:model_name_or_path: str field(metadata{help: Path to pretrained model or model identifier from huggingface.co/models})config_name: Optional[str] field(defaultNone, metadata{help: Pretrained config name or path if not the same as model_name})tokenizer_name: Optional[str] field(defaultNone, metadata{help: Pretrained tokenizer name or path if not the same as model_name})cache_dir: Optional[str] field(defaultNone, metadata{help: Where do you want to store the pretrained models downloaded from s3})causal_lm: bool field(defaultFalse, metadata{help: Whether the model is a causal lm or not})lora_path: Optional[str] field(defaultNone, metadata{help: Lora adapter save path})dataclass class DataArguments:data_name_or_path: str field(defaultNone, metadata{help: Path to train data})train_group_size: int field(default2)unfold_each_positive: bool field(defaultFalse)query_max_length: int field(default32,metadata{help: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated, sequences shorter will be padded.},)document_max_length: int field(default128,metadata{help: The maximum total input sequence length after tokenization for passage. Sequences longer than this will be truncated, sequences shorter will be padded.},)query_instruction: str field(defaultNone, metadata{help: instruction for query})document_instruction: str field(defaultNone, metadata{help: instruction for document})query_key: str field(defaultNone)positive_key: str field(defaultpositive)negative_key: str field(defaultnegative)is_query: bool field(defaultFalse)encoding_save_file: str field(defaultembed.pkl)def __post_init__(self):# self.data_name_or_path jsonself.dataset_split trainself.dataset_language defaultif self.data_name_or_path is not None:if not os.path.isfile(self.data_name_or_path) and not os.path.isdir(self.data_name_or_path):info self.data_name_or_path.split(/)self.dataset_split info[-1] if len(info) 3 else trainself.data_name_or_path /.join(info[:-1]) if len(info) 3 else /.join(info)self.dataset_language defaultif : in self.data_name_or_path:self.data_name_or_path, self.dataset_language self.data_name_or_path.split(:)dataclass class RetrieverTrainingArguments(TrainingArguments):train_type: str field(defaultpairwise, metadata{help: train type of point, pair, or list})negatives_cross_device: bool field(defaultFalse, metadata{help: share negatives across devices})temperature: Optional[float] field(default0.02)fix_position_embedding: bool field(defaultFalse, metadata{help: Freeze the parameters of position embeddings})pooling_method: str field(defaultcls, metadata{help: the pooling method, should be cls or mean})normalized: bool field(defaultTrue)loss_fn: str field(defaultinfonce)use_inbatch_negative: bool field(defaultTrue, metadata{help: use documents in the same batch as negatives})remove_unused_columns: bool field(defaultFalse)use_lora: bool field(defaultFalse)use_bnb_config: bool field(defaultFalse)do_encode: bool field(defaultFalse, metadata{help: run the encoding loop})report_to: Optional[List[str]] field(defaultnone, metadata{help: The list of integrations to report the results and logs to.})def main():parser HfArgumentParser((ModelArguments, DataArguments, RetrieverTrainingArguments))model_args, data_args, training_args parser.parse_args_into_dataclasses()model_args: ModelArgumentsdata_args: DataArgumentstraining_args: TrainingArgumentsif (os.path.exists(training_args.output_dir)and os.listdir(training_args.output_dir)and training_args.do_trainand not training_args.overwrite_output_dir):raise ValueError(fOutput directory ({training_args.output_dir}) already exists and is not empty. Use --overwrite_output_dir to overcome.)logging.basicConfig(format%(asctime)s - %(levelname)s - %(name)s - %(message)s,datefmt%m/%d/%Y %H:%M:%S,levellogging.INFO if training_args.local_rank in [-1, 0] else logging.WARN,)logger.warning(Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s,training_args.local_rank,training_args.device,training_args.n_gpu,bool(training_args.local_rank ! -1),training_args.fp16,)logger.info(Training/evaluation parameters %s, training_args)logger.info(Model parameters %s, model_args)logger.info(Data parameters %s, data_args)set_seed(training_args.seed)tokenizer AutoTokenizer.from_pretrained(model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,cache_dirmodel_args.cache_dir,use_fastFalse,)if training_args.use_bnb_config:from transformers import BitsAndBytesConfiglogger.info(Use quantization bnb config)quantization_config BitsAndBytesConfig(load_in_4bitTrue,bnb_4bit_use_double_quantTrue,bnb_4bit_quant_typenf4,bnb_4bit_compute_dtypetorch.bfloat16,)else:quantization_config Noneif training_args.do_train:model AutoModelForEmbedding.from_pretrained(model_name_or_pathmodel_args.model_name_or_path,pooling_methodtraining_args.pooling_method,use_loratraining_args.use_lora,quantization_configquantization_config,)loss_fn AutoLoss(loss_nametraining_args.loss_fn,loss_kwargs{use_inbatch_negative: training_args.use_inbatch_negative,temperature: training_args.temperature,},)model model.set_train_type(pairwise,loss_fnloss_fn,)train_dataset RetrievalTrainDataset(argsdata_args,tokenizertokenizer,positive_keydata_args.positive_key,negative_keydata_args.negative_key,)logger.info(fTotal training examples: {len(train_dataset)})trainer RetrievalTrainer(modelmodel,argstraining_args,train_datasettrain_dataset,data_collatorTripletCollator(tokenizer,query_max_lengthdata_args.query_max_length,document_max_lengthdata_args.document_max_length,positive_keydata_args.positive_key,negative_keydata_args.negative_key,),)Path(training_args.output_dir).mkdir(parentsTrue, exist_okTrue)trainer.train()# trainer.save_model(training_args.output_dir)model.save_pretrained(training_args.output_dir)if trainer.is_world_process_zero():tokenizer.save_pretrained(training_args.output_dir)if training_args.do_encode:model AutoModelForEmbedding.from_pretrained(model_name_or_pathmodel_args.model_name_or_path,pooling_methodtraining_args.pooling_method,use_loratraining_args.use_lora,quantization_configquantization_config,lora_pathmodel_args.lora_path,)max_length data_args.query_max_length if data_args.is_query else data_args.document_max_lengthlogger.info(fEncoding will be saved in {training_args.output_dir})encode_dataset EncodeDataset(argsdata_args, tokenizertokenizer, max_lengthmax_length, text_keytext)logger.info(fNumber of train samples: {len(encode_dataset)}, max_length: {max_length})encode_loader DataLoader(encode_dataset,batch_sizetraining_args.per_device_eval_batch_size,collate_fnEncodeCollator(tokenizer, max_lengthmax_length, paddingmax_length),shuffleFalse,drop_lastFalse,num_workerstraining_args.dataloader_num_workers,)embeddings model.encode(encode_loader, show_progress_barTrue, convert_to_numpyTrue)lookup_indices list(range(len(encode_dataset)))with open(os.path.join(training_args.output_dir, data_args.encoding_save_file), wb) as f:pickle.dump((embeddings, lookup_indices), f)if __name__ __main__:main() 最终调用文件 shell run.sh MODEL_NAMEmoka-ai/m3e-baseTRAIN_DATA/root/kag101/src/open-retrievals/t2/t2_ranking.jsonl OUTPUT_DIR/root/kag101/src/open-retrievals/t2/ft_out# loss_fn: infonce, simcsedeepspeed -m --include localhost:0 embed.py \--deepspeed ds_zero2_no_offload.json \--output_dir $OUTPUT_DIR \--overwrite_output_dir \--model_name_or_path $MODEL_NAME \--do_train \--data_name_or_path $TRAIN_DATA \--positive_key positive \--negative_key negative \--pooling_method mean \--loss_fn infonce \--use_lora False \--query_instruction \--document_instruction \--learning_rate 3e-5 \--fp16 \--num_train_epochs 5 \--per_device_train_batch_size 32 \--dataloader_drop_last True \--query_max_length 64 \--document_max_length 256 \--train_group_size 4 \--logging_steps 100 \--temperature 0.02 \--save_total_limit 1 \--use_inbatch_negative false4. 测试 微调前性能 c-mteb t2-ranking score 微调后性能 采用infoNCE损失函数没有加in-batch negative而关注的是困难负样本经过微调map从0.654提升至0.692mrr从0.754提升至0.805 对比一下非deepspeed而是直接torchrun的微调 map略低mrr略高。猜测是因为deepspeed中设置的一些auto会和直接跑并不完全一样
http://www.dnsts.com.cn/news/201226.html

相关文章:

  • 长尾关键词爱站凡科互动游戏作弊软件
  • 做网站开发人员架构中企动力 网站报价
  • html5模板网站上海网架公司
  • WordPress网站转HTPPSwordpress ftp 端口
  • 安徽省建设厅证书查询官方网站石家庄网站建设wsjz
  • 自己可以建网站吗wordpress 页面id
  • 定南网站建设qq wordpress登陆
  • 西安网站设计制作多少钱高中男女做那个视频网站
  • 加强网站内容建设创新培训机构网站源码
  • 中英文网站怎么做国家企业信用公示(上海)
  • 网络服务提供者不履行法律行政法规规定的人才网站的seo怎么做
  • 天空台108网站找手工活带回家做红色系列的网站
  • 河南单位网站建设如何推广自己产品
  • 竟标网站源码天津做网页设计的公司
  • 国内做外贸如何访问外国网站网络优化推广公司哪家好
  • 上海网站搭建平台公司wordpress企业主题 视频
  • 织梦网站熊掌号改造怎么做河北手机网站制作企业
  • 甘肃省建设厅注册中心网站软件开发项目文档怎么写
  • 怎么制作网站内容现在网站建设需要多少钱
  • 厦门手机网站建设中国建设监理协会化工监理协会网站
  • 徐州网站设计师建站都需要什么
  • 成都网站建设开发公司长春有哪些互联网大厂
  • 东莞网站建设纸品包装网站建设 论文
  • 郑州网站建设制作公司如何恢复wordpress主题初始内容
  • 内容电商的网站如何做金属加工网站建设
  • 西红门网站建设公司网站访问慢的原因
  • 网站建设怎么外包好营销软文的范文
  • 免费网站制作模板中英企业网站模板
  • 5000做网站无障碍网站建设的意义
  • 范县网站建设费用郑州市做网站公司a汉狮