CBNet: A Composite Backbone Network Architecture for Object Detection

TIP

Abstract

Illustration of the proposed Composite Backbone Network (CBNet) architecture for object detection.

Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely , to construct high-performance detectors using existing open-source pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNet architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple identical backbone networks and gradually expands the receptive field to more effectively perform object detection. We also propose a better training strategy with auxiliary supervision for CBNet-based detectors. CBNet has strong generalization capabilities for different backbones and head designs of the detector architecture. Without additional pre-training of the composite backbone, CBNet can be adapted to various backbones (i.e., CNN-based vs. Transformer-based) and head designs of most mainstream detectors (i.e., one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNet introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our CB-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which are significantly better than the state-of-the-art results (i.e., 57.7% box AP and 50.2% mask AP) achieved by Swin-L, while reducing the training time by 6x. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data.

Results

Performance comparison of CBNet with different numbers of composite backbones (K) and pruning strategies.
Comparison with the state-of-the-art object detection and instance segmentation results on COCO. In collaboration with Swin Transformer, our CBNet achieves the state-of-the-art box AP and mask AP while using fewer training epochs.

Further Information

For more detailed information, check out our paper and code. We are happy to receive your feedback!

@article{DBLP:journals/tip/LiangCLWTCCL22,
  author    = {Ting{-}Ting Liang and
               Xiaojie Chu and
               Yudong Liu and
               Yongtao Wang and
               Zhi Tang and
               Wei Chu and
               Jingdong Chen and
               Haibin Ling},
  title     = {CBNet: {A} Composite Backbone Network Architecture for Object Detection},
  journal   = {IEEE Trans. Image Process.},
  volume    = {31},
  pages     = {6893--6906},
  year      = {2022},
  url       = {https://doi.org/10.1109/TIP.2022.3216771},
  doi       = {10.1109/TIP.2022.3216771},
  timestamp = {Mon, 05 Dec 2022 13:33:25 +0100},
  biburl    = {https://dblp.org/rec/journals/tip/LiangCLWTCCL22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org},
}