PhysFlow: Skin tone transfer for remote heart rate estimation through conditional normalizing flows

BMVC 2024

Joaquim Comas1, Antonia Alomar1, AdriĆ  Ruiz2, Federico Sukno1,
1Department of Information and Communication Technologies Pompeu Fabra University, Barcelona, Spain, 2Seedtag, Madrid, Spain

Example of skin tone augmentation in UCLA-rPPG dataset. The generated video preserve the same temporal consistency and pulsatile information as original video changing the facial skin tone.

Abstract

In recent years, deep learning methods have shown impressive results for camera-based remote physiological signal estimation, clearly surpassing traditional methods. However, the performance and generalization ability of Deep Neural Networks heavily depends on rich training data truly representing different factors of variation encountered in real applications. Unfortunately, many current remote photoplethysmography (rPPG) datasets lack diversity, particularly in darker skin tones, leading to biased performance of existing rPPG approaches. To mitigate this bias, we introduce PhysFlow, a novel method for augmenting skin diversity in remote heart rate estimation using conditional normalizing flows. PhysFlow adopts end-to-end training optimization, enabling simultaneous training of supervised rPPG approaches on both original and generated data. Additionally, we condition our model using CIELAB color space skin features directly extracted from the facial videos without the need for skin-tone labels. We validate PhysFlow on publicly available datasets, UCLA-rPPG and MMPD, demonstrating reduced heart rate error, particularly in dark skin tones. Furthermore, we demonstrate its versatility and adaptability across different data-driven rPPG methods.

Skin tone representation

Most previous studies have used the Fitzpatrick scale to evaluate or categorize skin tone, dividing it into six levels, from I (lightest) to VI (darkest). In contrast, our skin tone transfer method employs a bi-dimensional representation in the CIELAB color space, which offers three key advantages. First, it eliminates the need for manual annotations, allowing it to be applied to unlabeled data. Second, it simplifies the collection and annotation process for new rPPG datasets. Finally, it accounts for variations in hue as well as lightness, providing a more nuanced representation of skin tone.

Framework

A 3D-CNN AE encodes entangled video facial content into a latent embedding. This embedding is then processed by c-CNFs to disentangle the skin tone content. Simultaneously, the rPPG model is iteratively trained using both original and skin tone-augmented data.

Qualitative Results

Visual skin tone diversity data augmentation. We show how PhysFlow is capable of transferring skin tone while preserving the pulsatile wave and the corresponding heart rate in the frequency spectrum from the source and the augmented video.
PhysFlow skin tone diversity data augmentation on the same subject, varying the luminance target value from dark to light skin (0.25 to 0.65).

Quantitative Results

Our cross-dataset experiments on the MMPD dataset using three different data-driven models demonstrate the capability of PhysFlow for skin tone diversity augmenting in any supervised rPPG, showing how our approach significantly reduces heart rate estimation error, particularly in underrepresented skin tone categories, favoring equitable performance across different skin tones.

PhysFlow cross-evaluation on darkness skin types of MMPD dataset (beats per minute).

Skin tone disentanglement

t-SNE visualization of the conditioned latent space for different skin tone variations using luminance component. To demonstrate the ability of the c-CNF to disentangle skin tone information from other content, we select six facial video sequence and condition them to specific skin tones. We adjust the luminance term while maintaining a constant hue value, as luminance is the most relevant factor for skin type determination. For visualization purposes, we vary the luminance value from 0.20 to 0.50, covering a range from lighter to darker skin tones. On the left side, the latent space visualization depicts each cluster of points representing the facial video sequence from the six selected subjects. In this visualization, we appreciate how the Physflow through the horizontal dimension (dimension 1) can disentangle the skin tone information after conditioning each facial video to a specific luminance value.

BibTeX

@article{comas2024physflow,
  title={PhysFlow: Skin tone transfer for remote heart rate estimation through conditional normalizing flows},
  author={Comas, Joaquim and Alomar, Antonia and Ruiz, Adria and Sukno, Federico},
  journal={arXiv preprint arXiv:2407.21519},
  year={2024}
}

Acknowledgements

This work is partly supported by the eSCANFace project (PID2020-114083GB-I00) funded by the Spanish Ministry of Science and Innovation.