如何用自己电脑做网站页面,欧式建筑网站,河北定制网站建设调试,动态ip建网站Monitoring 支持多种后端#xff1a;Tensorboard、WandB、Comet、CSV文件#xff1b; TensorBoard例子#xff1a; 自动监控#xff1a;DeepSpeed自动把重要metric记录下来。只需在配置文件里enable相应的看板后端即可#xff1a; {tensorboard: {enabl… Monitoring 支持多种后端Tensorboard、WandB、Comet、CSV文件 TensorBoard例子 自动监控DeepSpeed自动把重要metric记录下来。只需在配置文件里enable相应的看板后端即可 {tensorboard: {enabled: true,output_path: output/ds_logs/,job_name: train_bert}wandb: {enabled: true,team: my_team,group: my_group,project: my_project}comet: {enabled: true,project: my_project,experiment_name: my_experiment}csv_monitor: {enabled: true,output_path: output/ds_logs/,job_name: train_bert}
} 自定义监控 # Step 1: Import monitor (and DeepSpeed config, if needed) from deepspeed.monitor.monitor import MonitorMaster from deepspeed.runtime.config import DeepSpeedConfig # Step 2: Initialized monitor with DeepSpeed config (get DeepSpeed config object, if needed)ds_config DeepSpeedConfig(ds_config.json)monitor MonitorMaster(ds_config.monitor_config) for epoch in range(2): running_loss 0.0 for i, data in enumerate(trainloader): pre time.time() inputs, labels data[0].to(model_engine.local_rank), data[1].to( model_engine.local_rank) if fp16: inputs inputs.half() outputs model_engine(inputs) loss criterion(outputs, labels) model_engine.backward(loss) model_engine.step() post time.time() # Step 3: Create list of 3-tuple records (single entry in this case) events [(Time per step, post-pre, model_engine.global_samples)] # Step 4: Call monitor.write_events on the list from step 3 monitor.write_events(events) [(Time per step, post-pre, model_engine.global_samples)]表名纵轴值横轴值 通信Logging 注意加了logging, 所有通信将改为同步对性能会有伤害。 所有deepspeed.comm下的通信都将被统计上。 在配置文件里打开 comms_logger: {enabled: true,verbose: false,prof_all: true,debug: false
} verbose: 边跑边把发生的通信一条条写下来。例 [2022-06-26 01:39:55,722] [INFO] [logging.py:69:log_dist] [Rank 0] rank0 | comm op: reduce_scatter_tensor | time (ms): 9.46 | msg size: 678.86 MB | algbw (Gbps): 1204.52 | busbw (Gbps): 1129.23
[2022-06-26 01:39:56,470] [INFO] [logging.py:69:log_dist] [Rank 0] rank0 | comm op: all_gather_into_tensor | time (ms): 0.11 | msg size: 6.0 MB | algbw (Gbps): 954.41 | busbw (Gbps): 894.76
[2022-06-26 01:39:56,471] [INFO] [logging.py:69:log_dist] [Rank 0] rank0 | comm op: all_gather_into_tensor | time (ms): 0.08 | msg size: 6.0 MB | algbw (Gbps): 1293.47 | busbw (Gbps): 1212.63 algbw: algorithm bandwidth, 发生的通信size/实际通信时间 busbw: 硬件理论带宽是个固定值 algbw如果比busbw小太多说明糟糕有待进一步优化 总结式deepspeed.comm.log_summary() Comm. Op Message Size Count Total Latency(ms) Avg Latency(ms) tput_avg (Gbps) busbw_avg (Gbps)
broadcast2.0 KB 146 11.12 0.08 0.43 0.4198.25 MB 1 8317.12 8317.12 0.20 0.19
reduce_scatter_tensor678.86 MB 40 602.29 9.69 1468.06 1376.31 展示通信等待时长 dist.log_summary(show_stragglerTrue) 这么计算的一次组播通信里每个rank的完成时间减去所有rank里完成最快的这些等待时间加和到一起 straggler sum(t_collectives - allreduce(t_collectives, MIN))