This study proposes a model for generating facial expressions from speech text. While previous research has focused on generating facial animation from audio, this study concentrates on directly generating expressions from text. The output utilizes Action Units (AUs) based on the Facial Action Coding System (FACS). To reduce computational complexity and enhance model scalability, the proposed architecture employs only the encoder component of the Transformer, omitting the decoder.
The model is trained using a sliding window approach, enabling generation of expressions for each token in temporal sequence. The dataset for training was constructed by collecting publicly available videos from the web, performing facial expression detection, and transcribing the speech content.