基于通道注意力和时序改进多摄像头的鸟瞰视角目标检测

doi:10.12422/j.issn.1672-6952.2024.06.012

摘要/Abstract

摘要：

基于摄像头构建的感知和检测系统，以较低的成本和较高的分辨率实现目标检测。通过六个单目相机生成的鸟瞰图（BEV）特征可进行目标检测。其中，BEV特征包含物体的位置和尺度，适用于各种自动驾驶任务。BEV检测器通常与深度预训练的图像骨干相结合，但是两者直接连接并不能突出2D特征与3D特征的对应关系。为了解决以上问题，使用通道注意力对输出特征图加权调整提议特征通道，并与深度估计模块相结合，突出了2D与3D特征的关系；通过时序叠加融合方式解决了继承式融合方式中过去信息逐渐丢失的问题，保证了模型能够充分利用历史信息。在NuScenes数据集上进行了广泛的实验，结果表明归一化累计得分（NDS）达到了0.604，比BEVFormer模型提升了0.035，验证了模型的有效性。

关键词: 自动驾驶, 鸟瞰图检测, 通道注意力, 目标检测, 注意力机制, 时空编码器

Abstract:

A perception and detection system based on cameras achieves target detection with lower cost and higher resolution. Target detection is performed using bird's?eye view (BEV) features generated by six monocular cameras. These BEV features include the position and scale of objects, making them suitable for various autonomous driving tasks. BEV detectors are typically combined with the deep pre?trained image backbones, but directly connecting the two does not effectively highlight the correspondence between 2D and 3D features. To address this issue, Channel Attention is applied to weight and adjust the proposed feature channels in the output feature map, and combined with a depth estimation module to emphasize the relationship between 2D and 3D features. Furthermore, a temporal aggregation fusion method is employed to solve the problem of gradual information loss in traditional fusion methods, ensuring that the model can fully leverage historical information. Extensive experiments on the NuScenes dataset show that the model achieves a Normalized Discounted Cumulative Score (NDS) of 0.604, a 0.035 improvement over the BEVFormer model, validating the effectiveness of the proposed approach.

Key words: Autonomous driving, Bird's?eye?view detection, Channel Attention, Object detection, Attention mechanism, Spatiotemporal encoder

中图分类号:

TP389.1

李伟杰, 祁军, 潘斌. 基于通道注意力和时序改进多摄像头的鸟瞰视角目标检测[J]. 辽宁石油化工大学学报, 2024, 44(6): 89-96.

Weijie LI, Jun QI, Bin PAN. Optimizing Bird's⁃Eye⁃View Object Detection from Multi⁃Camera Images via Channel Attention and Temporal Transformers[J]. Journal of Liaoning Petrochemical University, 2024, 44(6): 89-96.

图/表 9

参考文献 36

1	徐源，翟春艳，王国良.基于对抗学习与深度估计的车辆检测系统［J］.辽宁石油化工大学学报，2020，40（3）：83⁃90.
	XU Y，ZHAI C Y，WANG G L.Vehicle detection system based on adversarial learning and depth estimation［J］.Journal of Liaoning Petrochemical University，2020，40（3）：83⁃90.
2	LI Z Q，WANG W H，LI H Y，et al.BEVFormer：Learning bird's⁃eye⁃view representation from multi⁃camera images via spatiotemporal transformers［C］//Computer Vision⁃ECCV 2022.Cham：Springer，2022：1⁃18.
3	NG H M，RADIA K，CHEN J F，et al.Bev⁃seg：Bird's eye view semantic segmentation using geometry and semantic point cloud［EB/OL］.（2020⁃06⁃19）［2023⁃11⁃20］.https：//arxiv.org/abs/2006.11436.
4	PHILION J，FIDLER S.Lift，splat，shoot：Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D［C］//Computer Vision⁃ECCV 2020.Cham：Springer，2020：194⁃210.
5	HU A，MUREZ Z，MOHAN N，et al.Fiery：Future instance prediction in bird's⁃eye view from surround monocular cameras［C］//2021 IEEE/CVF International Conference on Computer Vision （ICCV）.Montreal：IEEE，2021：15273⁃15282.
6	BRAZIL G，PONS⁃MOLL G，et al.Kinematic 3D object detection in monocular video［C］//Computer Vision⁃ECCV 2020. Cham：Springer，2020：135⁃152.
7	MA X Z，OUYANG W L，SIMONELLI A，et al.3D object detection from images for autonomous driving：A survey［J］. IEEE Transactions on Pattern Analysis and Machine Intelligence，2024，46（5）：3537⁃3556.
8	LUO W J，YANG B，URTASUN R.Fast and furious：Real time end⁃to⁃end 3D detection， tracking and motion forecasting with a single convolutional net［C］//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City：IEEE，2018：3569⁃3577.
9	QI C R，ZHOU Y，NAJIBI M，et al.Offboard 3d object detection from point cloud sequences［C］//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Nashville：IEEE，2021：6130⁃6140.
10	KANG K，OUYANG W L，LI H S，et al.Object detection from video tubelets with convolutional neural networks［C］//2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）.Las Vegas：IEEE，2016：817⁃825.
11	YANG C Y，CHEN Y T，TIAN H，et al.BEVFormer v2：Adapting modern image backbones to bird's⁃eye⁃view recognition via perspective supervision［C］//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Vancouver：IEEE，2023：17830⁃17839.
12	HU J，Shen L，SUN G.Squeeze⁃and⁃excitation networks［C］//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City：IEEE，2018：7132⁃7141.
13	RODDICK T，KENDALL A，CIPOLLA R.Orthographic feature transform for monocular 3D object detection［C］//30th British Machine Vision Conference 2019.Cardiff：｛BMVA｝Press，2019：285.
14	WANG Y，CHAO W L，GARG D，et al.Pseudo⁃LiDAR from visual depth estimation：Bridging the gap in 3D object detection for autonomous driving［C］//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Long Beac：IEEE，2019：8437⁃8445.
15	PAN B W，SUN J K，LEUNG H Y T.Cross⁃view semantic segmentation for sensing surroundings［J］.IEEE Robotics and Automation Letters，2020，5（3）：4867⁃4873.
16	ZHOU B，KRÄHENBÜHL P.Cross⁃view transformers for real⁃time map⁃view semantic segmentation［C］//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.New Orleans：IEEE，2022：13760⁃13769.
17	LIU Y F，WANG T C，ZHANG X Y，et al.PETR：Position embedding transformation for multi⁃view 3d object detection［C］//Computer Vision⁃ECCV 2022.Cham：Springer，2022：531⁃548.
18	LI Y H，BAO H，GE Z，et al.Bevstereo： Enhancing depth estimation in multi⁃view 3D object detection with dynamic temporal stereo［J］.Proceedings of the AAAI Conference on Artificial Intelligence，2023，37（2）：1486⁃1494.
19	WANG Z R，MIN C，GE Z，et al.STS：Surround⁃view temporal stereo for multi⁃view 3D detection［EB/OL］.（2020⁃08⁃22）［2023⁃12⁃22］.https：//arxiv.org/abs/2208.10145.
20	HUANG J J，HUANG G，ZHU Z，et al.Bevdet：High⁃performance multi⁃camera 3D object detection in bird⁃eye⁃view［EB/OL］.（2021⁃12⁃22）［2023⁃12⁃22］.https：//arxiv.org/abs/2112.11790.
21	JIANG Y Q，ZHANG L，MIAO Z W，et al.Polarformer：Multicamera 3D object detection with polar transformers［EB/OL］.（2022⁃06⁃30）［2023⁃12⁃24］.https：//arxiv.org/abs/2206.15398.
22	VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need［C］//Proceedings of the 31st International Conference on Neural Information Processing Systems.Red Hook：Curran Associates Inc.，2017：6000⁃6010.
23	MIECH A，LAPTEV L，SIVIC J.Learnable pooling with context gating for video classification［EB/OL］.（2017⁃06⁃21）［2023⁃12⁃25］.https：//arxiv.org/abs/1706.06905.
24	CAO C S，LIU X M，YANG L，et al.Look and think twice：Capturing top⁃down visual attention with feedback convolutional neural networks［C］//2015 IEEE International Conference on Computer Vision （ICCV）.Santiago：IEEE，2015：2956⁃2964.
25	WANG F，JIANG M Q，QIAN C，et al.Residual attention network for image classification［C］//2017 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）.Honolulu：IEEE，2017：6450⁃6458.
26	NEWELL A，YANG K Y，DENG J.Stacked hourglass networks for human pose estimation［C］//Computer Vision⁃ECCV 2016.Cham：Springer，2016：483⁃499.
27	WOO S，PARK J，LEE J Y，et al.CBAM：Convolutional block attention module［C］//Computer Vision⁃ECCV 2018. Cham：Springer，2018：3⁃19.
28	ZHU X Z，SU W J，LU L W，et al.Deformable DETR：Deformable transformers for end⁃to⁃end object detection［C］//International Conference on Learning Representations 2021.Vienna：ICLR，2021：1⁃16.
29	CAESAR H，BANKITI V，LANG A H，et al. nuScenes：A multimodal dataset for autonomous driving［C］//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）.Seattle：IEEE，2020：11618⁃11628.
30	HE K M，ZHANG X Y，REN S Q，et al.Deep residual learning for image recognition［C］//2016 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）.Las Vegas：IEEE，2016：770⁃778.
31	LEE Y W，HWANG J W，LEE S，et al.An energy and GPU⁃computation efficient backbone network for real⁃time object detection［C］//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）.Long Beach：IEEE，2019：752⁃760.
32	LIN T S，MAIRE M，BELONGIE S，et al.Microsoft COCO：Common objects in context［C］//Computer Vision⁃ECCV 2014.Cham：Springer，2014：740⁃755.
33	LOSHCHILOV I，HUTTER F.Decoupled weight decay regularization［C］//International Conference on Learning Representations 2019.New Orleans：ICLR，2019：1⁃8.
34	WANG T，ZHU X G，PANG J M，et al.FCOS3D： Fully convolutional one⁃stage monocular 3D object detection［C］//2021 IEEE/CVF International Conference on Computer Vision Workshops （ICCVW）.Montreal：IEEE，2021：913⁃922.
35	WANG T，ZHU X G，PANG J M，et al.Probabilistic and geometric depth：Detecting objects in perspective［C］//Proceedings of the 5th Conference on Robot Learning.New York：PMLR，2022：1475⁃1485.
36	WANG Y，GUIZILINI V C，ZHANG T Y，et al.Detr3D：3D object detection from multi⁃view images via 3D⁃to⁃2D queries［C］//Proceedings of the 5th Conference on Robot Learning.New York：PMLR，2022：180⁃191.

方法	方式	图像主干	累计得分	精度	平移误差	尺度误差	方向误差	速度误差	属性误差
FCOS3D^[34]	Camera	ResNet⁃101	0.428	0.358	0.690	0.249	0.452	1.434	0.124
PGD^[35]	Camera	ResNet⁃101	0.448	0.386	0.626	0.245	0.451	1.509	0.127
BEVFormer^[2]	Camera	ResNet⁃101	0.535	0.445	0.631	0.257	0.405	0.435	0.143
DETR3D^[36]	Camera	VoVnet⁃99	0.479	0.412	0.641	0.255	0.394	0.845	0.133
BEVFormer^[2]	Camera	VoVnet⁃99	0.569	0.481	0.582	0.256	0.375	0.378	0.126
本文	Camera	ResNet⁃101	0.585	0.517	0.635	0.231	0.356	0.382	0.128
本文	Camera	VoVnet⁃99	0.604	0.552	0.653	0.235	0.352	0.336	0.116

方法	方式	图像主干	累计得分	精度	平移误差	尺度误差	方向误差	速度误差	属性误差
FCOS3D^[34]	Camera	ResNet⁃101	0.428	0.358	0.690	0.249	0.452	1.434	0.124
PGD^[35]	Camera	ResNet⁃101	0.448	0.386	0.626	0.245	0.451	1.509	0.127
BEVFormer^[2]	Camera	ResNet⁃101	0.535	0.445	0.631	0.257	0.405	0.435	0.143
DETR3D^[36]	Camera	VoVnet⁃99	0.479	0.412	0.641	0.255	0.394	0.845	0.133
BEVFormer^[2]	Camera	VoVnet⁃99	0.569	0.481	0.582	0.256	0.375	0.378	0.126
本文	Camera	ResNet⁃101	0.585	0.517	0.635	0.231	0.356	0.382	0.128
本文	Camera	VoVnet⁃99	0.604	0.552	0.653	0.235	0.352	0.336	0.116

图像主干	方法	轮次	得分	精度	平移误差	尺度误差	方向误差	速度误差	属性误差
ResNet⁃101^[30]	BEVFormer	20	0.535	0.445	0.631	0.257	0.405	0.435	0.143
SENet & ResNet⁃101	BEVFormer	20	0.565	0.517	0.635	0.231	0.356	0.382	0.128
VoVnet⁃99^[31]	BEVFormer	20	0.569	0.471	0.592	0.249	0.382	0.374	0.128
SENet & VoVnet⁃99	BEVFormer	20	0.598	0.532	0.642	0.253	0.364	0.335	0.118

图像主干	方法	轮次	得分	精度	平移误差	尺度误差	方向误差	速度误差	属性误差
ResNet⁃101^[30]	BEVFormer	20	0.535	0.445	0.631	0.257	0.405	0.435	0.143
SENet & ResNet⁃101	BEVFormer	20	0.565	0.517	0.635	0.231	0.356	0.382	0.128
VoVnet⁃99^[31]	BEVFormer	20	0.569	0.471	0.592	0.249	0.382	0.374	0.128
SENet & VoVnet⁃99	BEVFormer	20	0.598	0.532	0.642	0.253	0.364	0.335	0.118

方式	图像主干	得分	平移误差	尺度误差	方向误差	速度误差	属性误差	平移误差
BEVFormer^[2]	ResNet⁃101	0.535	0.445	0.631	0.257	0.405	0.435	0.143
Improved⁃TSA & BEVFormer	ResNet⁃101	0.585	0.517	0.635	0.231	0.356	0.382	0.128
BEVFormer^[2]	VoVnet⁃99	0.569	0.481	0.582	0.256	0.375	0.378	0.126
Improved⁃TSA & BEVFormer	VoVnet⁃99	0.604	0.552	0.653	0.235	0.352	0.336	0.116