Present methods mostly ignore the role of inter-speaker dependency relations while
classifying emotions in conversations. In this paper, we address recognizing utterance-level
emotions in dyadic conversational videos. We propose a deep neural framework, termed
conversational memory network, which leverages contextual information from the
conversation history. The framework takes a multimodal approach comprising audio, visual …