莱州网站制作,网站开发能做什么,找人搭建网站多少钱,威海网站建设whhl一、Save Best
今天的大模型#xff0c;在训练过程中可能会终止#xff0c;但是模型其实是可以接着练的#xff0c;假设GPU挂了#xff0c;可以接着训练#xff0c;在原有的权重上#xff0c;训练其实就是更新w#xff0c;如果前面对w进行了存档#xff0c;那么可以从…一、Save Best
今天的大模型在训练过程中可能会终止但是模型其实是可以接着练的假设GPU挂了可以接着训练在原有的权重上训练其实就是更新w如果前面对w进行了存档那么可以从存档的比较优秀的地方进行训练。
下面代码默认每500步保存权重第二个参数是选择保存最佳权重
class SaveCheckpointsCallback:def __init__(self, save_dir, save_step500, save_best_onlyTrue):Save checkpoints each save_epoch epoch. We save checkpoint by epoch in this implementation.Usually, training scripts with pytorch evaluating model and save checkpoint by step.Args:save_dir (str): dir to save checkpointsave_epoch (int, optional): the frequency to save checkpoint. Defaults to 1.save_best_only (bool, optional): If True, only save the best model or save each model at every epoch.self.save_dir save_dir # 保存路径self.save_step save_step # 保存步数self.save_best_only save_best_only # 是否只保存最好的模型self.best_metrics -1 # 最好的指标指标不可能为负数所以初始化为-1# mkdirif not os.path.exists(self.save_dir): # 如果不存在保存路径则创建os.mkdir(self.save_dir)def __call__(self, step, state_dict, metricNone):if step % self.save_step 0: #每隔save_step步保存一次returnif self.save_best_only:assert metric is not None # 必须传入metricif metric self.best_metrics:# save checkpointstorch.save(state_dict, os.path.join(self.save_dir, best.ckpt)) # 保存最好的模型覆盖之前的模型不保存step只保存state_dict即模型参数不保存优化器参数# update best metricsself.best_metrics metricelse:torch.save(state_dict, os.path.join(self.save_dir, f{step}.ckpt)) # 保存每个step的模型不覆盖之前的模型保存step保存state_dict即模型参数不保存优化器参数
二、Early Stop
如果训练着验证集的准确率开始下降或者损失上升就需要用到早停
class EarlyStopCallback:def __init__(self, patience5, min_delta0.01):Args:patience (int, optional): Number of epochs with no improvement after which training will be stopped.. Defaults to 5.min_delta (float, optional): Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement. Defaults to 0.01.self.patience patience # 多少个step没有提升就停止训练self.min_delta min_delta # 最小的提升幅度self.best_metric -1self.counter 0 # 计数器记录多少个step没有提升def __call__(self, metric):if metric self.best_metric self.min_delta:#用准确率# update best metricself.best_metric metric# reset counter self.counter 0else: self.counter 1 # 计数器加1下面的patience判断用到property #使用property装饰器使得 对象.early_stop可以调用不需要()def early_stop(self):return self.counter self.patience
三、Tensorboard
# TensorBoard 可视化pip install tensorboard
训练过程中可以使用如下命令启动tensorboard服务。注意使用绝对路径否则会报错shelltensorboard --logdirD:\PycharmProjects\pythondl\chapter_2_torch\runs --host 0.0.0.0 --port 8848