Paper: https://arxiv.org/abs/2304.08069
Code: https://github.com/lyuwenyu/RT-DETR

DETR的几大缺点

  • 训练收敛速度慢
  • 计算成本高
  • 查询难以优化

已有的部分变体进行的改善

加速收敛

Deformable-DETR:通过提高注意机制的效率,加速多尺度特征的训练收敛。

DAB-DETR and DN-DETR: 通过引入迭代改进方案和去噪训练进一步提高性能。

Group-DETR: 引入分组式一对多分配。

降低计算成本

Efficient DETR and Sparse DETR: 通过减少编码器和解码器层的数量或更新查询的数量来降低计算成本。

Lite DETR: 以交错的方式降低底层特征的更新频率,提高编码器的效率。

优化查询初始化

Conditional DETR and Anchor DETR: 降低查询的优化难度

NMS分析

采用的是efficientNMSPlugin、EfficientNMSFilter、RadixSort三个插件的采用的是efficientNMSPlugin。
https://github.com/NVIDIA/TensorRT/tree/release/8.6/plugin/efficientNMSPlugin

efficientNMSPlugin

用途:高效非极大值抑制(NMS)插件,通常用于目标检测模型中去除多余的边界框。

工作原理:该插件通过快速筛选掉得分较低或重叠度较高的边界框,只保留最优的检测结果,从而提高模型的预测精度和效率。

EfficientNMSFilter

用途:进一步优化和过滤边界框的插件,通常配合 EfficientNMSPlugin 使用。

工作原理:在非极大值抑制后,对边界框进行更细致的过滤和处理,以确保最终输出的边界框更加准确和符合实际需求。

RadixSort

用途:一种高效的排序算法,常用于需要对大量数据进行快速排序的场景。

工作原理:基数排序通过逐位对数字进行排序,通常从最低有效位开始,到最高有效位结束,利用计数排序(Counting Sort)或桶排序(Bucket Sort)等辅助算法来实现。

EfficientNMS

用途:执行核心的非极大值抑制操作。

功能:在排序好的边界框中,移除重叠度较高且得分较低的边界框,只保留最优的检测结果。

工作原理

排序(RadixSort):首先使用 RadixSort 内核对所有边界框按照得分进行排序。排序确保得分最高的边界框优先处理。

过滤(EfficientNMSFilter):使用 EfficientNMSFilter 内核对排序后的边界框进行初步过滤,去除得分较低的边界框。

非极大值抑制(EfficientNMS):使用 EfficientNMS 内核对过滤后的边界框进行非极大值抑制,移除重叠度较高的边界框,只保留最优的边界框。

模型设计

image-1720727964612
RT-DETR由主干、高效混合编码器和带辅助预测头的变压器解码器组成。

AIFI-CCFF设计思路

image-1720767080083

AIFI模块

就是把原来对三个的Transformer编码器改成了对高维特征进行的编码。

CCFF模块

所谓的CCFM其实还是PaFPN(Feature Pyramid Networks for Object Detection),其中的Fusion模块就是一个CSPBlock(A New Backbone that can Enhance Learning Capability of CNN)风格的模块。
image-1720767471611

DINO-R50

image-1720765158467

单尺度SSE

import torch
import torch.nn as nn

class SingleScaleTransformer(nn.Module):
    def __init__(self, in_channels, out_channels, num_heads, num_layers):
        super(SingleScaleTransformer, self).__init__()
        self.proj_layer = nn.Conv2d(in_channels, out_channels, kernel_size=1)
        self.transformer_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=out_channels, nhead=num_heads),
            num_layers=num_layers
        )
        self.position_encoding = nn.Parameter(torch.randn(1, 1000, out_channels))

    def forward(self, feature):
        # feature: Single scale feature map
        projected_feature = self.proj_layer(feature).flatten(2).permute(0, 2, 1)
        # Add position encoding
        projected_feature += self.position_encoding[:, :projected_feature.size(1), :]
        # Apply Transformer Encoder
        transformed_feature = self.transformer_encoder(projected_feature.permute(1, 0, 2)).permute(1, 0, 2)
        return transformed_feature

# Example usage:
feature = torch.randn(2, 256, 32, 32)  # Single scale feature map
model = SingleScaleTransformer(in_channels=256, out_channels=256, num_heads=8, num_layers=6)
output = model(feature)
print(output.shape)

多尺度MSE

import torch
import torch.nn as nn

class MultiScaleTransformer(nn.Module):
    def __init__(self, in_channels_list, out_channels, num_heads, num_layers):
        super(MultiScaleTransformer, self).__init__()
        self.proj_layers = nn.ModuleList([
            nn.Conv2d(in_channels, out_channels, kernel_size=1)
            for in_channels in in_channels_list
        ])
        self.transformer_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=out_channels, nhead=num_heads),
            num_layers=num_layers
        )
        self.position_encoding = nn.Parameter(torch.randn(1, 1000, out_channels))

    def forward(self, features):
        # features: List of multi-scale feature maps
        proj_features = [proj_layer(f).flatten(2).permute(0, 2, 1) for proj_layer, f in zip(self.proj_layers, features)]
        concatenated_features = torch.cat(proj_features, dim=1)
        # Add position encoding
        concatenated_features += self.position_encoding[:, :concatenated_features.size(1), :]
        # Apply Transformer Encoder
        transformed_features = self.transformer_encoder(concatenated_features.permute(1, 0, 2)).permute(1, 0, 2)
        return transformed_features

# Example usage:
features = [torch.randn(2, 256, 32, 32), torch.randn(2, 512, 16, 16), torch.randn(2, 1024, 8, 8)]
model = MultiScaleTransformer(in_channels_list=[256, 512, 1024], out_channels=256, num_heads=8, num_layers=6)
output = model(features)
print(output.shape)

实验结果

对于这个内容,证明了两件事。

  1. Transformer的Encoder部分只需要处理high-level特征,既能大幅度削减计算量、提升计算速度,同时也不会损伤到性能,甚至还有所提升
  2. 对于多尺度特征的交互和融合,我们仍可以采用CNN架构常用的PAN网络来搭建,只需要一些细节上的调整即可。

Uncertainty-minimal Query Selection(IoU soft label)

重要性

DAB-DETR对查询向量进行重构理解,将其解释为Anchor,DN-DETR通过查询降噪来应对匈牙利匹配的二义性所导致的训练时间长的问题,DINO提出从Encoder中选择Top-k特征进行学习等一系列方法,这都无疑向我们证明,查询向量很重要,选择好的Query能够让我们事半功倍。

加噪

匈牙利匹配不稳定导致坐标偏移量难以学习,进一步导致模型收敛速度慢。

DN-DETR中,对GT加噪后的数据作为decoder的输入,然后让decoder学习去噪,是的预测出来的结果逼近真实的GT。

image-1720800355896
加噪这部分会分出很多组出来,每一组采用的加噪是不一样的值,RT-DETR有正负两个样本。
7a6612b79dc95efcd46423d115116c51_720
加噪操作主要是对原始的预测到的目标进行加噪,比如说预测两个类别分别以下

改进思路

由于分类分数和 IOU 分数的分布存在不一致,分类得分高的预测框并不一定是和 GT 最接近的框,这导致高分类分数低 IOU 的框会被选中,而低分类分数高 IOU 的框会被丢弃,这将会损害检测器的性能。

RT-DETR考虑通过在训练期间约束检测器对高 IOU 的特征产生高分类分数,对低 IOU 的特征产生低分类分数。故而,作者提出了 IoU-aware Query selection。从而使得模型根据分类分数选择的 Top-K 特征对应的预测框同时具有高分类分数和高 IOU 分数。

image-1720791269343

L(y^,y)=Lbox(b^,b)+Lcls(c^,b^,y,b)=Lbox(b^,b)+Lcls(c^,c,IoU)L(\hat y,y)=L_{box}(\hat b,b)+L_{cls}(\hat c,\hat b,y,b)=L_{box}(\hat b,b)+L_{cls}(\hat c,c,IoU)

y^\hat y指的是预测的标签和类别,y是GT。

c是类别。
“IoU软标签”就是指将预测框与GT之间的IoU作为类别预测的标签。

Decoder

输入包括两部分(concatenate),一个是加噪后的GT,另外一个是通过IoU-aware选择出的topk的query用于选择做二分图匹配任务。如果是预测部分加噪GT的值都是None。

0cfabdf9f67d5f2717f698ee53aa59fc_720

代码解读

https://blog.csdn.net/weixin_42479327/article/details/136037668

编码器code

# https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/transformers/hybrid_encoder.py

class HybridEncoder(nn.Layer):
    __shared__ = ['depth_mult', 'act', 'trt', 'eval_size']
    __inject__ = ['encoder_layer']

    def __init__(self,
                 in_channels=[512, 1024, 2048],
                 feat_strides=[8, 16, 32],
                 hidden_dim=256,
                 use_encoder_idx=[2],
                 num_encoder_layers=1,
                 encoder_layer='TransformerLayer',
                 pe_temperature=10000,
                 expansion=1.0,
                 depth_mult=1.0,
                 act='silu',
                 trt=False,
                 eval_size=None):
        super(HybridEncoder, self).__init__()
        self.in_channels = in_channels
        self.feat_strides = feat_strides
        self.hidden_dim = hidden_dim
        self.use_encoder_idx = use_encoder_idx
        self.num_encoder_layers = num_encoder_layers
        self.pe_temperature = pe_temperature
        self.eval_size = eval_size

        # channel projection
        self.input_proj = nn.LayerList()
        for in_channel in in_channels:
            self.input_proj.append(
                nn.Sequential(
                    nn.Conv2D(in_channel, hidden_dim, kernel_size=1, bias_attr=False),
                    nn.BatchNorm2D(
                        hidden_dim,
                        weight_attr=ParamAttr(regularizer=L2Decay(0.0)),
                        bias_attr=ParamAttr(regularizer=L2Decay(0.0)))))
        # encoder transformer
        self.encoder = nn.LayerList([
            TransformerEncoder(encoder_layer, num_encoder_layers)
            for _ in range(len(use_encoder_idx))
        ])

        act = get_act_fn(
            act, trt=trt) if act is None or isinstance(act, str, dict)) else act
        # top-down fpn
        self.lateral_convs = nn.LayerList()
        self.fpn_blocks = nn.LayerList()
        for idx in range(len(in_channels) - 1, 0, -1):
            self.lateral_convs.append(
                BaseConv(hidden_dim, hidden_dim, 1, 1, act=act))
            self.fpn_blocks.append(
                CSPRepLayer(
                    hidden_dim * 2,
                    hidden_dim,
                    round(3 * depth_mult),
                    act=act,
                    expansion=expansion))

        # bottom-up pan
        self.downsample_convs = nn.LayerList()
        self.pan_blocks = nn.LayerList()
        for idx in range(len(in_channels) - 1):
            self.downsample_convs.append(
                BaseConv(hidden_dim, hidden_dim, 3, stride=2, act=act))
            self.pan_blocks.append(
                CSPRepLayer(
                    hidden_dim * 2,
                    hidden_dim,
                    round(3 * depth_mult),
                    act=act,
                    expansion=expansion))

    def forward(self, feats, for_mot=False):
        assert len(feats) == len(self.in_channels)
        # get projection features
        proj_feats = [self.input_proj[i](feat) for i, feat in enumerate(feats)]
        # encoder
        if self.num_encoder_layers > 0:
            for i, enc_ind in enumerate(self.use_encoder_idx):
                h, w = proj_feats[enc_ind].shape[2:]
                # flatten [B, C, H, W] to [B, HxW, C]
                src_flatten = proj_feats[enc_ind].flatten(2).transpose(
                    [0, 2, 1])
                if self.training or self.eval_size is None:
                    pos_embed = self.build_2d_sincos_position_embedding(
                        w, h, self.hidden_dim, self.pe_temperature)
                else:
                    pos_embed = getattr(self, f'pos_embed{enc_ind}', None)
                memory = self.encoder[i](src_flatten, pos_embed=pos_embed)
                proj_feats[enc_ind] = memory.transpose([0, 2, 1]).reshape(
                    [-1, self.hidden_dim, h, w])

        # top-down fpn
        inner_outs = [proj_feats[-1]]
        for idx in range(len(self.in_channels) - 1, 0, -1):
            feat_heigh = inner_outs[0]
            feat_low = proj_feats[idx - 1]
            feat_heigh = self.lateral_convs[len(self.in_channels) - 1 - idx](
                feat_heigh)
            inner_outs[0] = feat_heigh

            upsample_feat = F.interpolate(
                feat_heigh, scale_factor=2., mode="nearest")
            inner_out = self.fpn_blocks[len(self.in_channels) - 1 - idx](
                paddle.concat(
                    [upsample_feat, feat_low], axis=1))
            inner_outs.insert(0, inner_out)

        # bottom-up pan
        outs = [inner_outs[0]]
        for idx in range(len(self.in_channels) - 1):
            feat_low = outs[-1]
            feat_height = inner_outs[idx + 1]
            downsample_feat = self.downsample_convs[idx](feat_low)
            out = self.pan_blocks[idx](paddle.concat(
                [downsample_feat, feat_height], axis=1))
            outs.append(out)

        return outs

解码器code

解码器采用的是和DETR一样的结构

 class RTDETRTransformer(nn.Layer):
    __shared__ = ['num_classes', 'hidden_dim', 'eval_size']

    def __init__(self,
                 num_classes=80,
                 hidden_dim=256,
                 num_queries=300,
                 position_embed_type='sine',
                 backbone_feat_channels=[512, 1024, 2048],
                 feat_strides=[8, 16, 32],
                 num_levels=3,
                 num_decoder_points=4,
                 nhead=8,
                 num_decoder_layers=6,
                 dim_feedforward=1024,
                 dropout=0.,
                 activation="relu",
                 num_denoising=100,
                 label_noise_ratio=0.5,
                 box_noise_scale=1.0,
                 learnt_init_query=True,
                 query_pos_head_inv_sig=False,
                 eval_size=None,
                 eval_idx=-1,
                 eps=1e-2):
        super(RTDETRTransformer, self).__init__()
        assert position_embed_type in ['sine', 'learned'], \
            f'ValueError: position_embed_type not supported {position_embed_type}!'
        assert len(backbone_feat_channels) <= num_levels
        assert len(feat_strides) == len(backbone_feat_channels)
        for _ in range(num_levels - len(feat_strides)):
            feat_strides.append(feat_strides[-1] * 2)

        self.hidden_dim = hidden_dim
        self.nhead = nhead
        self.feat_strides = feat_strides
        self.num_levels = num_levels
        self.num_classes = num_classes
        self.num_queries = num_queries
        self.eps = eps
        self.num_decoder_layers = num_decoder_layers
        self.eval_size = eval_size

        # backbone feature projection
        self._build_input_proj_layer(backbone_feat_channels)

        # Transformer module
        decoder_layer = TransformerDecoderLayer(
            hidden_dim, nhead, dim_feedforward, dropout, activation, num_levels,
            num_decoder_points)
        self.decoder = TransformerDecoder(hidden_dim, decoder_layer,
                                          num_decoder_layers, eval_idx)

        # denoising part
        self.denoising_class_embed = nn.Embedding(
            num_classes,
            hidden_dim,
            weight_attr=ParamAttr(initializer=nn.initializer.Normal()))
        self.num_denoising = num_denoising
        self.label_noise_ratio = label_noise_ratio
        self.box_noise_scale = box_noise_scale

        # decoder embedding
        self.learnt_init_query = learnt_init_query
        if learnt_init_query:
            self.tgt_embed = nn.Embedding(num_queries, hidden_dim)
        self.query_pos_head = MLP(4, 2 * hidden_dim, hidden_dim, num_layers=2)
        self.query_pos_head_inv_sig = query_pos_head_inv_sig

        # encoder head
        self.enc_output = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.LayerNorm(
                hidden_dim,
                weight_attr=ParamAttr(regularizer=L2Decay(0.0)),
                bias_attr=ParamAttr(regularizer=L2Decay(0.0))))
        self.enc_score_head = nn.Linear(hidden_dim, num_classes)
        self.enc_bbox_head = MLP(hidden_dim, hidden_dim, 4, num_layers=3)

        # decoder head
        self.dec_score_head = nn.LayerList([
            nn.Linear(hidden_dim, num_classes)
            for _ in range(num_decoder_layers)
        ])
        self.dec_bbox_head = nn.LayerList([
            MLP(hidden_dim, hidden_dim, 4, num_layers=3)
            for _ in range(num_decoder_layers)
        ])

        self._reset_parameters()

    def forward(self, feats, pad_mask=None, gt_meta=None, is_teacher=False):
        # input projection and embedding
        (memory, spatial_shapes,
         level_start_index) = self._get_encoder_input(feats)

        # prepare denoising training
        if self.training:
            denoising_class, denoising_bbox_unact, attn_mask, dn_meta = \
                get_contrastive_denoising_training_group(gt_meta,
                                            self.num_classes,
                                            self.num_queries,
                                            self.denoising_class_embed.weight,
                                            self.num_denoising,
                                            self.label_noise_ratio,
                                            self.box_noise_scale)
        else:
            denoising_class, denoising_bbox_unact, attn_mask, dn_meta = None, None, None, None

        target, init_ref_points_unact, enc_topk_bboxes, enc_topk_logits = \
            self._get_decoder_input(
            memory, spatial_shapes, denoising_class, denoising_bbox_unact,is_teacher)

        # decoder
        out_bboxes, out_logits = self.decoder(
            target,
            init_ref_points_unact,
            memory,
            spatial_shapes,
            level_start_index,
            self.dec_bbox_head,
            self.dec_score_head,
            self.query_pos_head,
            attn_mask=attn_mask,
            memory_mask=None,
            query_pos_head_inv_sig=self.query_pos_head_inv_sig)
        return (out_bboxes, out_logits, enc_topk_bboxes, enc_topk_logits,
                dn_meta)

Q.E.D.


曲中思念今犹在,不见当年梦中人。