BeatFormer: Efficient motion-robust remote heart rate estimation through unsupervised spectral zoomed attention filters

Abstract

Remote photoplethysmography (rPPG) captures cardiac signals from facial videos and is gaining attention for its diverse applications. While deep learning has advanced rPPG estimation, it relies on large, diverse datasets for effective generalization. In contrast, handcrafted methods utilize physiological priors for better generalization in unseen scenarios like motion while maintaining computational efficiency. However, their linear assumptions limit performance in complex conditions, where deep learning provides superior pulsatile information extraction. This highlights the need for hybrid approaches that combine the strengths of both methods. To address this, we present BeatFormer, a lightweight spectral attention model for rPPG estimation, which integrates zoomed orthonormal complex attention and frequency-domain energy measurement, enabling a highly efficient model. Additionally, we introduce Spectral Contrastive Learning (SCL), which allows BeatFormer to be trained without any PPG or HR labels. We validate BeatFormer on the PURE, UBFC-rPPG, and MMPD datasets, demonstrating its robustness and performance, particularly in cross-dataset evaluations under motion scenarios.

Framework

First, RGB traces are segmented with overlap and transformed into the frequency domain using the Chirp-Z Transform (CZT). The zoomed spectrum is then processed by BeatFormer to filter pulsatile information from distortions, incorporating orthonormal regularization and energy-based weighting. The filtered frequency features are converted back to the temporal domain using the Inverse Chirp-Z Transform (ICZT), followed by an overlap-add operation to reconstruct the rPPG signal. To train BeatFormer, spectral contrastive learning (SCL) is applied, leveraging frequency-domain meaningful transformations to enforce explicit priors during training, enabling label-free training.

Spectral contrastive learning

Spectral contrastive learning (SCL) leverages meaningful frequency-domain video transformations, based on the following assumptions:

Pulse content remains consistent under different color representations.
rPPG signals are weaker and quasi-periodic, while motion signals are stronger and more chaotic.
Cardiac signals can be recovered during motion if sufficient skin is visible.

Quantitative Results

Across all MMPD motion splits, both versions of BeatFormer (supervised and spectral contrastive learning) achieve significantly lower errors, highlighting stronger motion robustness than current state-of-the-art, particularly DL approaches.

Mean absolute error comparison on MMPD motion scenarios (beats per minute).

Qualitative Results

Inference examples for cross-dataset MMPD subjects trained on PURE dataset. BeatFormer-SL (yellow), BeatFormer-SCL (magenta), and the PPG ground truth (black).

BibTeX

@article{comas2025beatformer,
  title={BeatFormer: Efficient motion-robust remote heart rateestimation through unsupervisedspectral zoomed attention filters},
  author={Comas, Joaquim and Sukno, Federico},
  journal={ICCVW},
  year={2025}
}