言語モデルを用いた発話内容に基づくFACS生成モデルの提案,小橋龍人・宇治川遥祐・高汐一紀(慶大),電子情報通信学会技術研究報告,vol. 124, no. 143,2024年8月

This study proposes a model for generating facial expressions from speech text. While previous research has focused on generating facial animation from audio, this study concentrates on directly generating expressions from text. The output utilizes Action Units (AUs) based on the Facial Action Coding System (FACS). To reduce computational complexity and enhance model scalability, the proposed architecture employs only the encoder component of the Transformer, omitting the decoder.
The model is trained using a sliding window approach, enabling generation of expressions for each token in temporal sequence. The dataset for training was constructed by collecting publicly available videos from the web, performing facial expression detection, and transcribing the speech content.