Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Abstract

Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling social interactions.

New Social Benchmarks

We introduce three new tasks to model multi-party social interactions: speaking target identification, pronoun coreference resolution, and mentioned player prediction. These tasks are challenging because they require understanding the fine-grained interplay of verbal and non-verbal cues exchanged between multiple people. To address this, we propose the necessity of densely aligned language-visual representations and introduce a novel baseline leveraging this concept.

Proposed Baseline Model

Our proposed baseline model leverages densely aligned representations to capture the fine-grained dynamics between multiple people. The language-visual alignment (grey) matches language and visual cues to identify individuals in both domains. The visual interaction modeling (green & purple) extracts visual relationships between the speaker and listeners from non-verbal cues, while the conversation context modeling (red) incorporates the surrounding language context from utterances. By fusing these aligned multimodal representations (blue), the model can predict the referents appropriately.

Qualitative Results

Player# are assigned in ascending order from left to right in the visual scenes here.

BibTeX

@inproceedings{lee2024modeling, title={Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations}, author={Lee, Sangmin and Lai, Bolin and Ryan, Fiona and Boote, Bikram and Rehg, James M}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2024} }