A perception and detection system based on cameras achieves target detection with lower cost and higher resolution. Target detection is performed using bird's?eye view (BEV) features generated by six monocular cameras. These BEV features include the position and scale of objects, making them suitable for various autonomous driving tasks. BEV detectors are typically combined with the deep pre?trained image backbones, but directly connecting the two does not effectively highlight the correspondence between 2D and 3D features. To address this issue, Channel Attention is applied to weight and adjust the proposed feature channels in the output feature map, and combined with a depth estimation module to emphasize the relationship between 2D and 3D features. Furthermore, a temporal aggregation fusion method is employed to solve the problem of gradual information loss in traditional fusion methods, ensuring that the model can fully leverage historical information. Extensive experiments on the NuScenes dataset show that the model achieves a Normalized Discounted Cumulative Score (NDS) of 0.604, a 0.035 improvement over the BEVFormer model, validating the effectiveness of the proposed approach.