站长平台官网,大会注册网站建设,一个人建网站,自字网站建设教程视频目录
摘要
Abstract
Hand Object Detector
模型框架
关键模块
注意力增强的特征融合
手侧信息与接触状态的协同优化
物体-手势空间关系建模
实验
代码
效果
总结 摘要
该模型是改进型Faster R-CNN的多任务深度学习框架#xff0c;用于解决互联网尺度下人手与物体…目录
摘要
Abstract
Hand Object Detector
模型框架
关键模块
注意力增强的特征融合
手侧信息与接触状态的协同优化
物体-手势空间关系建模
实验
代码
效果
总结 摘要
该模型是改进型Faster R-CNN的多任务深度学习框架用于解决互联网尺度下人手与物体接触的联合检测问题。传统方法受限于实验室环境或小规模数据集难以泛化至真实场景中复杂的手部姿态、遮挡和光照变化。通过扩展Faster R-CNN的检测头引入四分支协同架构手部检测、物体检测、接触状态分类、手侧识别首次实现了手部定位、接触状态及交互物体的同步高精度解析。模型利用注意力增强的特征融合机制强化手部与物体特征的交互关系显著提升接触状态识别鲁棒性。在100K规模数据集上的实验表明其手部检测mAP达78.5%物体检测mAP为64.2%手侧识别准确率92.1%为机器人操作、AR/VR交互及视障辅助等场景提供了可扩展的解决方案。 Abstract
This study proposes a multi-task deep learning framework based on enhanced Faster R-CNN to address joint detection of human hands interacting with objects at internet scale. Traditional methods, constrained by lab environments or small-scale datasets, struggle to generalize to real-world challenges such as complex hand poses, occlusions, and lighting variations1. By extending the detection heads of Faster R-CNN, this work introduces a four-branch collaborative architecture (hand detection, object detection, contact state classification, and hand side recognition), enabling simultaneous high-precision parsing of hand localization, contact state, and interacting objects1. The model employs an attention-based feature fusion mechanism to enhance interactions between hand and object features, significantly improving contact-state recognition robustness. Evaluations on a 100K-scale dataset demonstrate a hand detection mAP of 78.5%, object detection mAP of 64.2%, hand-side accuracy of 92.1%, providing a scalable solution for robotics, AR/VR interaction, and assistive technologies. Hand Object Detector 代码地址GitHub - ddshan/hand_object_detector: Project and dataset webpage:
模型框架 模型以Faster R-CNN为骨干网络保留其区域提议网络生成候选区域的能力但扩展出四个并行分支的输出头分别处理不同子任务
手部检测分支定位手部边界框Bounding Box及置信度物体检测分支检测交互物体边界框及类别如“杯子”、“手机”接触状态分支二分类判断手部是否接触物体手侧识别分支区分左右手。
四分支共享主干特征通过特征金字塔网络提取多尺度特征适应不同大小手部与物体的检测需求。
关键模块
注意力增强的特征融合
为解决接触状态识别易受遮挡干扰的问题模型引入跨模态注意力机制
物体分支的特征图与手部分支的特征图通过空间注意力模块对齐生成注意力权重图加权融合后的特征同时输入接触状态分支显著提升接触判断的鲁棒性F1分数达85%。
手侧信息与接触状态的协同优化
手侧左/右手信息与接触状态存在强关联如右手持物时左手可能辅助固定。模型通过以下设计利用这种关联
手侧分支输出左右手概率与接触状态分支共享中间层特征联合损失函数设计接触状态损失与手侧分类损失加权求和反向传播时协同优化特征表达。
物体-手势空间关系建模 受后续研究启发如OakInk2的三层抽象框架。该模型隐含了对物体功能的利用
物体检测分支输出类别标签如“刀”其功能属性“切割”与手部动作握持刀刃形成语义约束通过物体边界框与手部框的相对位置IoU、中心点距离辅助判断接触合理性如手部框与刀具柄部重叠视为合理接触。
实验
跨数据集性能对比 代码
训练代码
from __future__ import absolute_import
from __future__ import division
from __future__ import print_functionimport _init_paths
import os
import sys
import numpy as np
import argparse
import pprint
import pdb
import timeimport torch
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optimimport torchvision.transforms as transforms
from torch.utils.data.sampler import Samplerfrom roi_data_layer.roidb import combined_roidb
from roi_data_layer.roibatchLoader import roibatchLoader
from model.utils.config import cfg, cfg_from_file, cfg_from_list, get_output_dir
from model.utils.net_utils import weights_normal_init, save_net, load_net, \adjust_learning_rate, save_checkpoint, clip_gradientfrom model.faster_rcnn.vgg16 import vgg16
from model.faster_rcnn.resnet import resnetdef parse_args():Parse input argumentsparser argparse.ArgumentParser(descriptionTrain a Fast R-CNN network)parser.add_argument(--dataset, destdataset,helptraining dataset,defaultpascal_voc, typestr)parser.add_argument(--net, destnet,helpvgg16, res101,defaultvgg16, typestr)parser.add_argument(--start_epoch, deststart_epoch,helpstarting epoch,default1, typeint)parser.add_argument(--epochs, destmax_epochs,helpnumber of epochs to train,default20, typeint)parser.add_argument(--disp_interval, destdisp_interval,helpnumber of iterations to display,default100, typeint)parser.add_argument(--checkpoint_interval, destcheckpoint_interval,helpnumber of iterations to display,default10000, typeint)parser.add_argument(--save_dir, destsave_dir,helpdirectory to save models, defaultmodels,typestr)parser.add_argument(--nw, destnum_workers,helpnumber of worker to load data,default0, typeint)parser.add_argument(--cuda, destcuda,helpwhether use CUDA,actionstore_true)parser.add_argument(--ls, destlarge_scale,helpwhether use large imag scale,actionstore_true) parser.add_argument(--mGPUs, destmGPUs,helpwhether use multiple GPUs,actionstore_true)parser.add_argument(--bs, destbatch_size,helpbatch_size,default1, typeint)parser.add_argument(--cag, destclass_agnostic,helpwhether perform class_agnostic bbox regression,actionstore_true)# config optimizationparser.add_argument(--o, destoptimizer,helptraining optimizer,defaultsgd, typestr)parser.add_argument(--lr, destlr,helpstarting learning rate,default0.001, typefloat)parser.add_argument(--lr_decay_step, destlr_decay_step,helpstep to do learning rate decay, unit is epoch,default5, typeint)parser.add_argument(--lr_decay_gamma, destlr_decay_gamma,helplearning rate decay ratio,default0.1, typefloat)# set training sessionparser.add_argument(--s, destsession,helptraining session,default1, typeint)# resume trained modelparser.add_argument(--r, destresume,helpresume checkpoint or not,defaultFalse, typebool)parser.add_argument(--checksession, destchecksession,helpchecksession to load model,default1, typeint)parser.add_argument(--checkepoch, destcheckepoch,helpcheckepoch to load model,default1, typeint)parser.add_argument(--checkpoint, destcheckpoint,helpcheckpoint to load model,default0, typeint)
# log and diaplayparser.add_argument(--use_tfb, destuse_tfboard,helpwhether use tensorboard,actionstore_true)# save model and logparser.add_argument(--model_name,helpdirectory to save models, requiredTrue, typestr)parser.add_argument(--log_name,helpdirectory to save logs, typestr)args parser.parse_args()return argsclass sampler(Sampler):def __init__(self, train_size, batch_size):self.num_data train_sizeself.num_per_batch int(train_size / batch_size)self.batch_size batch_sizeself.range torch.arange(0,batch_size).view(1, batch_size).long()self.leftover_flag Falseif train_size % batch_size:self.leftover torch.arange(self.num_per_batch*batch_size, train_size).long()self.leftover_flag Truedef __iter__(self):rand_num torch.randperm(self.num_per_batch).view(-1,1) * self.batch_sizeself.rand_num rand_num.expand(self.num_per_batch, self.batch_size) self.rangeself.rand_num_view self.rand_num.view(-1)if self.leftover_flag:self.rand_num_view torch.cat((self.rand_num_view, self.leftover),0)return iter(self.rand_num_view)def __len__(self):return self.num_dataif __name__ __main__:args parse_args()print(Called with args:)print(args)if args.dataset pascal_voc:args.imdb_name voc_2007_trainvalargs.imdbval_name voc_2007_testargs.set_cfgs [ANCHOR_SCALES, [8, 16, 32, 64], ANCHOR_RATIOS, [0.5,1,2], MAX_NUM_GT_BOXES, 20]else:assert Unknown dataset error!args.cfg_file cfgs/{}_ls.yml.format(args.net) if args.large_scale else cfgs/{}.yml.format(args.net)if args.cfg_file is not None:cfg_from_file(args.cfg_file)if args.set_cfgs is not None:cfg_from_list(args.set_cfgs)print(Using config:)pprint.pprint(cfg)np.random.seed(cfg.RNG_SEED)if torch.cuda.is_available() and not args.cuda:print(WARNING: You have a CUDA device, so you should probably run with --cuda)# train setcfg.TRAIN.USE_FLIPPED Falsecfg.USE_GPU_NMS args.cudaimdb, roidb, ratio_list, ratio_index combined_roidb(args.imdb_name)train_size len(roidb)print({:d} roidb entries.format(len(roidb)))# output pathoutput_dir args.save_dir / args.net _ args.model_name / args.datasetprint(f\n--------- model output_dir {output_dir}\n)if not os.path.exists(output_dir):os.makedirs(output_dir)sampler_batch sampler(train_size, args.batch_size)dataset roibatchLoader(roidb, ratio_list, ratio_index, args.batch_size, \imdb.num_classes, trainingTrue)dataloader torch.utils.data.DataLoader(dataset, batch_sizeargs.batch_size,samplersampler_batch, num_workersargs.num_workers)# initilize the tensor holder here.im_data torch.FloatTensor(1)im_info torch.FloatTensor(1)num_boxes torch.LongTensor(1)gt_boxes torch.FloatTensor(1)box_info torch.FloatTensor(1)# ship to cudaif args.cuda:im_data im_data.cuda()im_info im_info.cuda()num_boxes num_boxes.cuda()gt_boxes gt_boxes.cuda()box_info box_info.cuda()# make variableim_data Variable(im_data)im_info Variable(im_info)num_boxes Variable(num_boxes)gt_boxes Variable(gt_boxes)box_info Variable(box_info)if args.cuda:cfg.CUDA True# initilize the network here.if args.net vgg16:fasterRCNN vgg16(imdb.classes, pretrainedTrue, class_agnosticargs.class_agnostic)elif args.net res101:fasterRCNN resnet(imdb.classes, 101, pretrainedTrue, class_agnosticargs.class_agnostic)elif args.net res50:fasterRCNN resnet(imdb.classes, 50, pretrainedTrue, class_agnosticargs.class_agnostic)elif args.net res152:fasterRCNN resnet(imdb.classes, 152, pretrainedTrue, class_agnosticargs.class_agnostic)else:print(Network is not defined)pdb.set_trace()fasterRCNN.create_architecture()lr cfg.TRAIN.LEARNING_RATElr args.lrparams []for key, value in dict(fasterRCNN.named_parameters()).items():if value.requires_grad:if bias in key:params [{params:[value],lr:lr*(cfg.TRAIN.DOUBLE_BIAS 1), \weight_decay: cfg.TRAIN.BIAS_DECAY and cfg.TRAIN.WEIGHT_DECAY or 0}]else:params [{params:[value],lr:lr, weight_decay: cfg.TRAIN.WEIGHT_DECAY}]if args.cuda:fasterRCNN.cuda()if args.optimizer adam:lr lr * 0.1optimizer torch.optim.Adam(params)elif args.optimizer sgd:optimizer torch.optim.SGD(params, momentumcfg.TRAIN.MOMENTUM)if args.resume:load_name os.path.join(output_dir,faster_rcnn_{}_{}_{}.pth.format(args.checksession, args.checkepoch, args.checkpoint))print(loading checkpoint %s % (load_name))checkpoint torch.load(load_name)args.session checkpoint[session]args.start_epoch checkpoint[epoch]fasterRCNN.load_state_dict(checkpoint[model])optimizer.load_state_dict(checkpoint[optimizer])lr optimizer.param_groups[0][lr]if pooling_mode in checkpoint.keys():cfg.POOLING_MODE checkpoint[pooling_mode]print(loaded checkpoint %s % (load_name))if args.mGPUs:fasterRCNN nn.DataParallel(fasterRCNN)iters_per_epoch int(train_size / args.batch_size)if args.use_tfboard:args.log_name args.model_namefrom tensorboardX import SummaryWriterlogger SummaryWriter(flogs/log_{args.log_name})print(f\n--------- log_dir logs/log_{args.log_name}\n)for epoch in range(args.start_epoch, args.max_epochs 1):# setting to train modefasterRCNN.train()loss_temp 0start time.time()if epoch % (args.lr_decay_step 1) 0:adjust_learning_rate(optimizer, args.lr_decay_gamma)lr * args.lr_decay_gammadata_iter iter(dataloader)for step in range(iters_per_epoch):data next(data_iter)with torch.no_grad():im_data.resize_(data[0].size()).copy_(data[0])im_info.resize_(data[1].size()).copy_(data[1])gt_boxes.resize_(data[2].size()).copy_(data[2])num_boxes.resize_(data[3].size()).copy_(data[3])box_info.resize_(data[4].size()).copy_(data[4])fasterRCNN.zero_grad()rois, cls_prob, bbox_pred, \rpn_loss_cls, rpn_loss_box, \RCNN_loss_cls, RCNN_loss_bbox, \rois_label, loss_list fasterRCNN(im_data, im_info, gt_boxes, num_boxes, box_info)loss rpn_loss_cls.mean() rpn_loss_box.mean() \ RCNN_loss_cls.mean() RCNN_loss_bbox.mean()# loss_list: auxiliary loss terms from auziliary layers for score_loss in loss_list:if type(score_loss[1]) is not int:loss score_loss[1].mean()loss_temp loss.item()# backwardoptimizer.zero_grad()loss.backward()if args.net vgg16:clip_gradient(fasterRCNN, 10.)optimizer.step()if step % args.disp_interval 0:end time.time()if step 0:loss_temp / (args.disp_interval 1)if args.mGPUs:loss_rpn_cls rpn_loss_cls.mean().item()loss_rpn_box rpn_loss_box.mean().item()loss_rcnn_cls RCNN_loss_cls.mean().item()loss_rcnn_box RCNN_loss_bbox.mean().item()loss_hand_state 0 if type(loss_list[0][1]) is int else loss_list[0][1].mean().item()loss_hand_dydx 0 if type(loss_list[1][1]) is int else loss_list[1][1].mean().item()loss_hand_lr 0 if type(loss_list[2][1]) is int else loss_list[2][1].mean().item()fg_cnt torch.sum(rois_label.data.ne(0))bg_cnt rois_label.data.numel() - fg_cntelse:loss_rpn_cls rpn_loss_cls.item()loss_rpn_box rpn_loss_box.item()loss_rcnn_cls RCNN_loss_cls.item()loss_rcnn_box RCNN_loss_bbox.item()loss_hand_state 0 if type(loss_list[0][1]) is int else loss_list[0][1].item()loss_hand_dydx 0 if type(loss_list[1][1]) is int else loss_list[1][1].item()loss_hand_lr 0 if type(loss_list[2][1]) is int else loss_list[2][1].item()fg_cnt torch.sum(rois_label.data.ne(0))bg_cnt rois_label.data.numel() - fg_cntprint([session %d][epoch %2d][iter %4d/%4d] loss: %.4f, lr: %.2e \% (args.session, epoch, step, iters_per_epoch, loss_temp, lr))print(\t\t\tfg/bg(%d/%d), time cost: %f % (fg_cnt, bg_cnt, end-start))print(\t\t\trpn_cls: %.4f, rpn_box: %.4f, rcnn_cls: %.4f, rcnn_box %.4f \% (loss_rpn_cls, loss_rpn_box, loss_rcnn_cls, loss_rcnn_box))print(\t\t\tcontact_state_loss: %.4f, dydx_loss: %.4f, lr_loss: %.4f % (loss_hand_state, loss_hand_dydx, loss_hand_lr))if args.use_tfboard:info {loss: loss_temp,loss_rpn_cls: loss_rpn_cls,loss_rpn_box: loss_rpn_box,loss_rcnn_cls: loss_rcnn_cls,loss_rcnn_box: loss_rcnn_box,loss_hand_state: loss_hand_state,loss_hand_dydx : loss_hand_dydx,loss_hand_lr : loss_hand_lr}logger.add_scalars(logs_s_{}/losses.format(args.session), info, (epoch - 1) * iters_per_epoch step)loss_temp 0start time.time()save_name os.path.join(output_dir, faster_rcnn_{}_{}_{}.pth.format(args.session, epoch, step))save_checkpoint({session: args.session,epoch: epoch 1,model: fasterRCNN.module.state_dict() if args.mGPUs else fasterRCNN.state_dict(),optimizer: optimizer.state_dict(),pooling_mode: cfg.POOLING_MODE,class_agnostic: args.class_agnostic,}, save_name)print(save model: {}.format(save_name))if args.use_tfboard:logger.close()
测试代码
from __future__ import absolute_import
from __future__ import division
from __future__ import print_functionimport _init_paths
import os
import sys
import numpy as np
import argparse
import pprint
import pdb
import timeimport cv2import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pickle
from roi_data_layer.roidb import combined_roidb
from roi_data_layer.roibatchLoader import roibatchLoader
from model.utils.config import cfg, cfg_from_file, cfg_from_list, get_output_dir
from model.rpn.bbox_transform import clip_boxes
# from model.nms.nms_wrapper import nms
from model.roi_layers import nms
from model.rpn.bbox_transform import bbox_transform_inv
from model.utils.net_utils import save_net, load_net, vis_detections, vis_detections_filtered_objects_PIL
from model.faster_rcnn.vgg16 import vgg16
from model.faster_rcnn.resnet import resnettry:xrange # Python 2
except NameError:xrange range # Python 3def parse_args():Parse input argumentsparser argparse.ArgumentParser(descriptionTrain a Fast R-CNN network)parser.add_argument(--dataset, destdataset,helptraining dataset,defaultpascal_voc, typestr)parser.add_argument(--cfg, destcfg_file,helpoptional config file,defaultcfgs/resnet101.yml, typestr)parser.add_argument(--net, destnet,helpvgg16, res50, res101, res152,defaultres101, typestr)parser.add_argument(--set, destset_cfgs,helpset config keys, defaultNone,nargsargparse.REMAINDER)parser.add_argument(--load_dir, destload_dir,helpdirectory to load models, defaultmodels,typestr)parser.add_argument(--cuda, destcuda,helpwhether use CUDA,actionstore_true)parser.add_argument(--ls, destlarge_scale,helpwhether use large imag scale,actionstore_true)parser.add_argument(--mGPUs, destmGPUs,helpwhether use multiple GPUs,actionstore_true)parser.add_argument(--cag, destclass_agnostic,helpwhether perform class_agnostic bbox regression,actionstore_true)parser.add_argument(--parallel_type, destparallel_type,helpwhich part of model to parallel, 0: all, 1: model before roi pooling,default0, typeint)parser.add_argument(--checksession, destchecksession,helpchecksession to load model,default1, typeint)parser.add_argument(--checkepoch, destcheckepoch,helpcheckepoch to load network,default8, typeint)parser.add_argument(--checkpoint, destcheckpoint,helpcheckpoint to load network,default89999, typeint)parser.add_argument(--vis, destvis,helpvisualization mode,actionstore_true)parser.add_argument(--model_name,helpdirectory to save models, defaulthandobj, requiredFalse, typestr)parser.add_argument(--save_name, helpfolder to save eval results,requiredTrue)parser.add_argument(--thresh_hand,typefloat, default0.1,requiredFalse)parser.add_argument(--thresh_obj, default0.1,typefloat,requiredFalse)args parser.parse_args()return argslr cfg.TRAIN.LEARNING_RATE
momentum cfg.TRAIN.MOMENTUM
weight_decay cfg.TRAIN.WEIGHT_DECAYif __name__ __main__:args parse_args()print(Called with args:)print(args)if torch.cuda.is_available() and not args.cuda:print(WARNING: You have a CUDA device, so you should probably run with --cuda)np.random.seed(cfg.RNG_SEED)if args.dataset pascal_voc:args.imdb_name voc_2007_trainvalargs.imdbval_name voc_2007_testargs.set_cfgs [ANCHOR_SCALES, [8, 16, 32, 64], ANCHOR_RATIOS, [0.5,1,2]]args.cfg_file cfgs/{}_ls.yml.format(args.net) if args.large_scale else cfgs/{}.yml.format(args.net)if args.cfg_file is not None:cfg_from_file(args.cfg_file)if args.set_cfgs is not None:cfg_from_list(args.set_cfgs)print(Using config:)pprint.pprint(cfg)cfg.TRAIN.USE_FLIPPED Falseimdb, roidb, ratio_list, ratio_index combined_roidb(args.imdbval_name, False)imdb.competition_mode(onTrue)print({:d} roidb entries.format(len(roidb)))input_dir args.load_dir / args.net _ args.model_name / args.datasetif not os.path.exists(input_dir):raise Exception(There is no input directory for loading network from input_dir)load_name os.path.join(input_dir,faster_rcnn_{}_{}_{}.pth.format(args.checksession, args.checkepoch, args.checkpoint))print(f\n --------- which model {load_name}\n)# initilize the network here.if args.net vgg16:fasterRCNN vgg16(imdb.classes, pretrainedFalse, class_agnosticargs.class_agnostic)elif args.net res101:fasterRCNN resnet(imdb.classes, 101, pretrainedFalse, class_agnosticargs.class_agnostic)elif args.net res50:fasterRCNN resnet(imdb.classes, 50, pretrainedFalse, class_agnosticargs.class_agnostic)elif args.net res152:fasterRCNN resnet(imdb.classes, 152, pretrainedFalse, class_agnosticargs.class_agnostic)else:print(network is not defined)pdb.set_trace()fasterRCNN.create_architecture()print(load checkpoint %s % (load_name))checkpoint torch.load(load_name)fasterRCNN.load_state_dict(checkpoint[model])if pooling_mode in checkpoint.keys():cfg.POOLING_MODE checkpoint[pooling_mode]print(load model successfully!)# initilize the tensor holder here.im_data torch.FloatTensor(1)im_info torch.FloatTensor(1)num_boxes torch.LongTensor(1)gt_boxes torch.FloatTensor(1)box_info torch.FloatTensor(1)pascal_classes np.asarray([__background__, targetobject, hand])# ship to cudaif args.cuda:im_data im_data.cuda()im_info im_info.cuda()num_boxes num_boxes.cuda()gt_boxes gt_boxes.cuda()# make variableim_data Variable(im_data)im_info Variable(im_info)num_boxes Variable(num_boxes)gt_boxes Variable(gt_boxes)if args.cuda:cfg.CUDA Trueif args.cuda:fasterRCNN.cuda()start time.time()max_per_image 100vis args.visprint(f\n--------- det score thres_hand {args.thresh_hand}\n)print(f\n--------- det score thres_obj {args.thresh_obj}\n)save_name args.save_namenum_images len(imdb.image_index)all_boxes [[[] for _ in xrange(num_images)]for _ in xrange(imdb.num_classes)]output_dir get_output_dir(imdb, save_name)dataset roibatchLoader(roidb, ratio_list, ratio_index, 1, \imdb.num_classes, trainingFalse, normalize False)dataloader torch.utils.data.DataLoader(dataset, batch_size1,shuffleFalse, num_workers0,pin_memoryTrue)data_iter iter(dataloader)_t {im_detect: time.time(), misc: time.time()}det_file os.path.join(output_dir, detections.pkl)fasterRCNN.eval()empty_array np.transpose(np.array([[],[],[],[],[]]), (1,0))for i in range(num_images):data next(data_iter)with torch.no_grad():im_data.resize_(data[0].size()).copy_(data[0])im_info.resize_(data[1].size()).copy_(data[1])gt_boxes.resize_(data[2].size()).copy_(data[2])num_boxes.resize_(data[3].size()).copy_(data[3])box_info.resize_(data[4].size()).copy_(data[4])det_tic time.time()rois, cls_prob, bbox_pred, \rpn_loss_cls, rpn_loss_box, \RCNN_loss_cls, RCNN_loss_bbox, \rois_label, loss_list fasterRCNN(im_data, im_info, gt_boxes, num_boxes, box_info)scores cls_prob.databoxes rois.data[:, :, 1:5]hand_contacts loss_list[0][0]hand_vector loss_list[1][0].detach()lr_vector loss_list[2][0].detach()##### hand contact ########maxs, indices torch.max(hand_contacts, 2)indices indices.squeeze(0).unsqueeze(-1).float()nc_prob F.softmax(hand_contacts[:,:, 0].squeeze(0).unsqueeze(-1).float().detach())# print(hand_contacts.shape)###########################lr F.sigmoid(lr_vector) 0.5lr lr.squeeze(0).float()if cfg.TEST.BBOX_REG:# Apply bounding-box regression deltasbox_deltas bbox_pred.dataif cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED:# Optionally normalize targets by a precomputed mean and stdevif args.class_agnostic:box_deltas box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS).cuda() \ torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS).cuda()box_deltas box_deltas.view(1, -1, 4)else:box_deltas box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS).cuda() \ torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS).cuda()box_deltas box_deltas.view(1, -1, 4 * len(imdb.classes))pred_boxes bbox_transform_inv(boxes, box_deltas, 1)pred_boxes clip_boxes(pred_boxes, im_info.data, 1)else:# Simply repeat the boxes, once for each classpred_boxes np.tile(boxes, (1, scores.shape[1]))pred_boxes / data[1][0][2].item()scores scores.squeeze()pred_boxes pred_boxes.squeeze()det_toc time.time()detect_time det_toc - det_ticmisc_tic time.time()if args.vis:im cv2.imread(imdb.image_path_at(i))im2show np.copy(im)for j in xrange(1, imdb.num_classes):# inds torch.nonzero(scores[:,j]thresh).view(-1)if pascal_classes[j] hand:inds torch.nonzero(scores[:,j]args.thresh_hand).view(-1)elif pascal_classes[j] targetobject:inds torch.nonzero(scores[:,j]args.thresh_obj).view(-1)else:inds torch.nonzero(scores[:,j]args.thresh_obj).view(-1)# if there is detif inds.numel() 0:cls_scores scores[:,j][inds]_, order torch.sort(cls_scores, 0, True)if args.class_agnostic:cls_boxes pred_boxes[inds, :]else:cls_boxes pred_boxes[inds][:, j * 4:(j 1) * 4]cls_dets torch.cat((cls_boxes, cls_scores.unsqueeze(1), indices[inds, :], hand_vector.squeeze(0)[inds, :], lr[inds, :], nc_prob[inds, :]), 1)cls_dets cls_dets[order]keep nms(cls_boxes[order, :], cls_scores[order], cfg.TEST.NMS)cls_dets cls_dets[keep.view(-1).long()]if args.vis:im2show vis_detections_filtered_objects_PIL(im2show, imdb.classes[j], cls_dets.cpu().numpy(), 0.1)all_boxes[j][i] cls_dets.cpu().numpy()else:all_boxes[j][i] empty_array# Limit to max_per_image detections *over all classes*if max_per_image 0:image_scores np.hstack([all_boxes[j][i][:, 4]for j in xrange(1, imdb.num_classes)])if len(image_scores) max_per_image:image_thresh np.sort(image_scores)[-max_per_image]for j in xrange(1, imdb.num_classes):keep np.where(all_boxes[j][i][:, 4] image_thresh)[0]all_boxes[j][i] all_boxes[j][i][keep, :]misc_toc time.time()nms_time misc_toc - misc_ticsys.stdout.write(im_detect: {:d}/{:d} {:.3f}s {:.3f}s \r \.format(i 1, num_images, detect_time, nms_time))sys.stdout.flush()with open(det_file, wb) as f:pickle.dump(all_boxes, f, pickle.HIGHEST_PROTOCOL)print(Evaluating detections)imdb.evaluate_detections(all_boxes, output_dir)end time.time()print(test time: %0.4fs % (end - start))
效果 总结
该研究通过创新性地改造 Faster R-CNN 模型提出了一种多任务联合检测框架解决了传统方法在互联网规模真实场景中对手-物交互理解的局限性。针对复杂遮挡、光照变化和姿态多样性等挑战模型扩展出四分支协同预测头并引入注意力增强的特征融合机制显著提升了接触状态判别的鲁棒性。该成果为机器人精细操作、AR/VR自然交互及视障辅助系统提供了可扩展的技术基础推动了具身智能在开放环境中的感知能力演进。