FESNet: Spotting Facial Expressions Using Local Spatial Discrepancy and Multi-Scale Temporal Aggregation

Bohao Zhang

Chongqing Key Laboratory of Precision Optics Chongqing Institute of East China Normal University Chongqing 401120, China & School of Computer Science and Technology East China Normal University Shanghai 200333, China
Jiale Lu

Chongqing Key Laboratory of Precision Optics Chongqing Institute of East China Normal University Chongqing 401120, China & School of Computer Science and Technology East China Normal University Shanghai 200333, China
Changbo Wang

Chongqing Key Laboratory of Precision Optics Chongqing Institute of East China Normal University Chongqing 401120, China & School of Computer Science and Technology East China Normal University Shanghai 200333, China
Gaoqi He

Chongqing Key Laboratory of Precision Optics Chongqing Institute of East China Normal University Chongqing 401120, China & School of Computer Science and Technology East China Normal University Shanghai 200333, China

FESNet: Spotting Facial Expressions Using Local Spatial Discrepancy and Multi-Scale Temporal Aggregation

keywords: Facial expression analysis, micro-expression spotting, video understanding, convolutional neural networks

Facial expressions (FEs) spotting aims to split long videos into intervals of neutral expression, macro-expression, or micro-expression. Recent works mainly focus on feature descriptor or optical flow methods, suffering from difficulty capturing subtle facial motion and efficient temporal aggregation. This paper proposes a novel end-to-end network, named FESNet (Facial Expression Spotting Network), to solve the above challenges. The main idea is to model the subtle facial motion as local spatial discrepancy and incorporate temporal correlation by multi-scale temporal convolution. The FESNet comprises a local spatial discrepancy module (LSDM) and a multi-scale temporal aggregation module (MTAM). The LSDM first extracts the static spatial features from each frame by residual convolution and learns the inner spatial correlation by multi-head attention. Moreover, the subtle facial motion of facial expression is modeled as the discrepancy between the first frame and the current frame of the input interval, making frame-wise spatial proposals. Using the local spatial discrepancy features and proposals as input, the MTAM incorporates the temporal correlation by multi-scale temporal convolution and performs cascade refinement to make the final prediction. Furthermore, this paper proposes a smooth loss to ensure the temporal consistency of the cascade refined proposals from MTAM. Comprehensive experiments show that FESNet achieves competitive performance compared to state-of-the-art methods.

reference: Vol. 43, 2024, No. 2, pp. 458–481

doi: 10.31577/cai_2024_2_458

Computing and Informatics

formerly Computers and Artificial Intelligence

FESNet: Spotting Facial Expressions Using Local Spatial Discrepancy and Multi-Scale Temporal Aggregation