Rajan Sharma
8 min readJul 24, 2023

--

Clothes-Changing Person Re-identification with RGB Modality Only

What is Person Re-identification?

Person Re-identification aims to search the target person from surveillance videos across different locations and times.

But what will happen if person change their clothes🤔. we are able to re-identify the person in that case? well, that’s what we are going to discuss in this blog. So keep reading😊.

Introduction:

Person Re-identification aims to search the target person from surveillance videos across different locations and times. Most existing works, assume that pedestrian do not change their clothes in a short period of time. However, if we want to re-identify a pedestrian over a long period of time, the clothes changing problem cannot be avoided. Besides, clothes-changing problem also exists in some short-time real-world scenarios, For example: criminal suspects usually change their clothes to avoid being identified and tracked.

The key to address clothes-changing person re-identification is to extract clothes-irrelevant features, for example: face, hairstyle, body shape and gait. Most current works mainly focus on modeling body shape from multi-modality information (for example: silhouettes and sketches), but do not make full use of the clothes-irrelevant information in the original RGB images.

So in this blog will see how authors of this paper make full use of the clothes-irrelevant features to tackle clothes-changing person re-identification problem.

Method:

To better mine the clothes-irrelevant information in RGB modality, the author propose Clothes-based Adversarial Loss (CAL). Specifically, they add a clothes classifier after the backbone of the re-identification model and define CAL as a multi-positive-class classification loss, where all clothes classes belonging to the same identity are mutually positive classes.

During training, minimizing CAL can force the backbone of the re-identification model to learn clothes-irrelevant features by penalizing the predictive power of the re-identification model w.r.t. different clothes of the same identity. With backpropagation, the learned feature map can highlight more clothes-irrelevant features, for example: hairstyle and body shape, compared with the feature map trained only with identification loss.

As we can see from Figure1 (b) that it only highlights face as the clothes-irrelevant features while Figure1 (c) highlights more clothes-irrelevant features for example: face, hairstyle and body shape. The difference between Figure1 (b) and Figure1 (c) is loss functions. The figure1 (b) has learned features maps only with identification loss while figure1 (c) has learned feature maps with identification loss and the proposed CAL.

The Figure2 shows the architecture of the proposed method. In the above figure or framework, g(theta) denotes the backbone with parameter theta and C^ID(.) denotes the identity classifier with parameters phi.

Give a sample X(i), its identity label is denoted as Y^ID and its clothes label is denoted as Y^C. Note that we define the clothes class as fine-grained identity class. All samples of the same identity are divided into different clothes classes belonging to this identity according to their clothes. The number of clothes classes is the sum of the number of suits of different persons. The annotation of such clothes labels is easy, since they only need to be labeled among all samples of the same person, and different persons do not share the same clothes label even if they wear the same clothes.

Existing re-id methods define identification loss, here also we define identification loss L(ID) using the cross entropy between predicted identity(C^ID(g(theta)(Xi)))) and identity label(Y^ID) and train the re-id model by minimizing L(ID). Here in this paper author define another losses, except for the identity classifier and the widely used identification loss, clothes classification loss L(C) is used to train an additional clothes classifier. The proposed Clothes-based Adversarial Loss(CAL) L(CA) is used to force the backbone to decouple clothes-irrelevant features.

In the first step, optimize the clothes classifier by minimizing clothes classification loss L(C)(the cross entropy loss between predicted clothes and clothes label).

In the second step, fix the parameters of the clothes classifier and force the backbone to learn clothes-irrelevant features. To this end, we should penalize the predictive power of the re-identification model w.r.t. clothes. A naive idea is defining L(CA) opposite to L(C) loss, such that the trained clothes classifier cannot distinguish all kinds of clothes in the training set. this is like widely used min-max optimization problem. However, since clothes class is defined as fine-grained identity class, penalizing the predictive power of re-identification model w.r.t. all kinds of clothes will also reduce its predictive power w.r.t. identity, which is harmful to re-identification. what we want to achieve is making the trained clothes classifier cannot distinguish the samples with the same identity and different clothes. So, L(CA) should be a multi-positive-class classification loss, where all clothes classes belonging to the same identity are mutually positive classes. For example: given a sample X(i) all clothes classes belonging to its identity class Y^ID are defined as its positive clothes classes. Therefore L(CA) can be formulated as:

As you can see from above figure positive class with the same clothes and the positive classes with different clothes have equal weight,i.e. 1/K.

In a long-term person re-id system, both clothes-consistent re-id and clothes-changing re-id are equally important. when we maximize the dot product between f(i) and the proxy of the positive class with different clothes, the accuracy of clothes-changing re-id can be improved but the accuracy of clothes-consistent re-id may reduce. To improve the clothes-changing re-id ability of the model without reducing the clothes-consistent re-id accuracy heavily,eq(4) in above fig can be replace by:

Note that L(ID) and L(CA) have some affinity in learning clothes-irrelevant features. When we only use L(ID) for training, the model tends to learn easy samples(with the same clothes) in the early stage of optimization and then learns to distinguish hard samples(with the same identity and different clothes)gradually. The objective of L(CA) is to pull the features with the same identity closer, which is similar to L(ID). The reason is that only minimizing L(CA) and forcing the model to distinguish hard samples in the early stage of optimization may lead to local optimum. On the contrary, we add L(CA) for training after the first reduction of the learning rate in experiments.

Dataset:

All existing publicly available video person re-id datasets for example: PRID,iLIDS-VID,MARS, etc. do not involve clothes-changes. Besides, existing publicly available clothes-changing person re-id datasets for example: Real28&VC-Clothes,LTCC,PRCC, etc. only contain still images and do not involve sequence data. However, clothes-changing video re-id is closer to real-world re-id scenarios, and the abundant appearance information and additional temporal information in the video samples are helpgul for clothes-changing re-id.

To provide a publicly available benchmark, they constructed a clothes-changing video person re-id(CCVID) dataset from the raw data of a gait recognition dataset i.e.,FVG. FVG dataset contains 2856 sequences from 226 identities and each identity has 2~5 different suits of clothes.

Implementation Details:

They use ResNet-50 as the backbone of the re-id model. As for the video-based dataset, i.e., CCVID, they use spatial max pooling and temporal average pooling to integrate the output feature map of the backbone and then use BatchNorm to normalize the video feature. The frame lengths of different video samples are different. During training, the frame lengths of inputs should be equal and each frame would better be sampled with equal probability. Hence, for each original video, they randomly sample 8 frames with stride of 4 to form a video clip. Each input frame is resized to 256x128 and only a horizontal flip is used for data augmentation. Due to the limit of GPU memory, the batch size is set to 32 and each batch contain 8 persons and 4 video clips for each person. The model is trained by Adam for 150 epochs and L(CA) is used for training after the 50th epoch. The learning rate is initialized to 3.5e^-4 and divided by 10 after every 40th epochs.

Previous Works:

1. **Learning 3d shape feature for texture-insensitive [[Person Re-identification]]** :

- ReID framework that extracts texture insensitive 3D shape [[Embeddings]] from a 2D image.

- 3D body reconstruction is used as an auxiliary task and [[Regularisation]], called 3D Shape Learning (3DSL).

- 3D reconstruction based regularisation forces the ReID model to decouple the 3D shape information from the visual texture

- Goal is to acquire discriminative 3D shape ReID features

- To solve lacking 3D GTs, adversarial self-supervised projection (ASSP) model is proposed, performing 3D reconstruction without ground truth.

2. **Fine-grained shape-appearance mutual learning for cloth-changing [[Person Re-identification]]** :

- A two-stream framework is proposed that learns fine-grained discriminative body shape knowledge in a shape stream ( Using discriminative body shape contour mask )

- Transfers it to an appearance stream to complement the cloth-unrelated knowledge in the appearance features.

3. **[[Person Re-identification]] by Contour Sketch under Moderate Clothing Change** :

- Assumption : person only changes his clothes moderately as a first attempt at solving this problem.

- A person wears clothes of a similar thickness, and thus the shape of a person would not change significantly when the weather does not change substantially within a short period of time.

- To Perform cross-clothes person re-id based on a contour sketch of person image to take advantage of the shape of the human body.

- To select more reliable and discriminative curve patterns on a body contour sketch

- Alearning-based spatial polar transformation layer in the deep neural network to transform contour sketch images for extracting reliable and discriminant CNN features in a polar coordinate space.

- An angle-specific extractor is applied in the following layers to extract more fine-grained discriminant angle-specific features

- To develop a multi-stream network for aggregating multi-granularity features for better reidentification

- Uses PRCC

4. **Learning Robust [[Global Representations]] by Penalizing Local Predictive Power** :

- PAR ( Patch wise adversarial Regularisation ) is the method for training robust convolutional networks by penalising the predictive power of the local representations learned by earlier layers.

- It penalizes the predictive power of local representations in earlier layers.

- The method consists of a patch-wise classifier applied at each spatial location in low-level representation

- Uses the reverse gradient technique — global adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer — Uses negative cross entropy to regularize early layers

Conclusions:

The author propose Clothes-based adversarial loss(CAL) for clothes-changing person re-identification. During training, CAL forces the backbone of the re-identification model to learn clothes-irrelevant features by penalizing its predictive power w.r.t clothes. As a result, the learned backbone can better mine the clothes-irrelevant information from the original RGB modality and is more robust against clothes changes.

References:

https://arxiv.org/pdf/2204.06890.pdf

--

--

Rajan Sharma

Data Science Enthusiast| Love to Write| Love to Read