Human perception of the surrounding environment, acquisition of information and communication are all multimodal. Going beyond research that have made breakthroughs in single modality over the past decade, it is of great interest to explore multimodal intelligence that involves multimodal information, e.g., vision, speech and language. Moreover, the rise of multimodal social network, short video applications, video conferencing, live video streaming and digital human highly demands the development of multimodal intelligence and offers a fertile ground for multimodal analysis. This workshop is to provide a forum for the community to exchange ideas and latest advances for multimodal assisted natural language processing, including emotion recognition, machine translation, information extraction, cross-modal analysis and so on.
◇ 14:00-14:05, Opening
◇ 14:05-14:40, Multimodal Emotion Recognition, Qin Jin (Renmin University of China)
◇ 14:40-15:15, Multimodal Neural Machine Translation, Jinsong Su (Xiamen University)
◇ 15:15-15:50, Vision to Language: from Independency, Interaction, to Symbiosis, Ting Yao (JD AI Research)
◇ 15:50-16:25, Multimodal Information Extraction for Social Media Posts, Jianfei Yu (Nanjing University of Science and Technology)
◇ 16:25-17:00, Speech and NLP Research at Huawei Noah’s Ark Lab, Yichun Yin (Huawei Noah’s Ark Lab)
Time: 14:05-14:40, 15th October, 2021
Title: Multimodal Emotion Recognition
Abstract : Automatic emotion recognition is an indispensable ability of intelligent human-computer interaction systems. The behavior signals of human emotion expression are multimodal, including voice, facial expression, body language, bio-signals etc. Therefore, our research focuses on robust emotion recognition via fusing multimodalities. This talk will present our recent works on multimodal emotion recognition from different perspectives.
Speaker: Qin Jin is a full professor in School of Information at Renmin University of China (RUC), where she leads the AI·M3 lab. She received her Ph.D. degree in 2007 at Carnegie Mellon University. Before joining RUC in 2013, she was a research faculty (2007-2012) and a research scientist (2012) at Carnegie Mellon University and IBM China Research Lab. Her research interests are in intelligent multimedia computing and human computer interaction. Her team’s recent works on video understanding and multimodal affective analysis have won various awards in international challenge evaluations, including CVPR ActivityNet Dense Video Captioning challenge, NIST TrecVID VTT evaluation, ACM Multimedia Audio-Visual Emotion Challenge etc.
Time: 14:40-15:15, 15th October, 2021
Title: Multi-modal Neural Machine Translation
Abstract : With the rapid development of deep learning, researches on neural machine translation (NMT) have shifted from traditional purely textual NMT to multi-modal NMT. In this report, I will first give the task definition of multi-modal NMT, summarize the typical studies in this aspect, and finally look forward to the future work.
Speaker: Su Jinsong is a Professor at School of Informatics, Xiamen University. He received the Ph.D. degree in Chinese Academy of Sciences in 2011. His research interests mainly include natural language processing, neural machine translation and text generation. He has published more than 70 CCF-A/B papers, won Hanwang Youth Innovation Excellence Award and is supported by Fujian outstanding youth fund.
Time: 15:15-15:50, 15th October, 2021
Title: Vision to Language: from Independency, Interaction, to Symbiosis
Abstract : Vision and Language are two fundamental capabilities of human intelligence. Humans routinely perform tasks through the interactions between vision and language, supporting the uniquely human capacity to talk about what they see. That motivates us researchers to expand the horizons of such cross-modal analysis. In particular, vision to language is probably one of the hottest topics in the past five years, with a significant growth in both volume of publications and extensive applications. In this talk, we look into the problem of vision to language, from three different perspectives: 1) Independency – aim for a thorough image/video understanding for language generation; 2) Interaction – explore the (1st, 2nd, …) interaction across vision and language inputs; 3) Symbiosis – learn a universal encoder-decoder structure for vision-language tasks. Moreover, we will also discuss the real-world deployments or services of vision to language.
Speaker: Ting Yao is currently a Principal Researcher in Vision and Multimedia Lab at JD AI Research, Beijing, China. His research interests include video understanding, vision and language, and deep learning. Prior to joining JD.com, he was a Researcher with Microsoft Research Asia, Beijing, China. Ting is the lead architect of a few top-performing multimedia analytic systems in international benchmark competitions such as ActivityNet Large Scale Activity Recognition Challenge 2019-2016, Visual Domain Adaptation Challenge 2019-2017, and COCO Image Captioning Challenge. He is also the lead organizer of MSR Video to Language Challenge in ACM Multimedia 2017 & 2016, and built MSR-VTT, a large-scale video to text dataset that is widely used worldwide. His works have led to many awards, including ACM SIGMM Outstanding Ph.D. Thesis Award 2015, ACM SIGMM Rising Star Award 2019, and IEEE TCMC Rising Star Award 2019. He is an Associate Editor of IEEE Trans. on Multimedia.
Time: 15:50-16:25, 15th October, 2021
Title: Multimodal Information Extraction for Social Media Posts
Abstract : Recent years have witnessed the explosive growth of multimodal user-generated contents on social media platforms such as Facebook, Instagram, Twitter and Snapchat. The analysis of social media posts now has to take into consideration of not only text but also other modalities of data such as images and videos. In this talk, I will first share some recent progress in multimodal information extraction on social media posts, and then report our recent work on multimodal named entity recognition and entity-level sentiment analysis. Finally, I will discuss several limitations of current studies, and point out some promising directions in the future.
Speaker: Jianfei Yu is an Associate Professor at School of Computer Science and Engineering, Nanjing University of Science and Technology (NJUST). He obtained his Ph.D. from Singapore Management University (SMU) in 2018, and has worked at SMU as a Research Scientist before joining NJUST. His research focuses on deep learning and transfer learning for many natural language processing tasks, including information extraction, sentiment analysis, question answering, social media analytics. He has served on the Virtual Infrastructure Committee of ACL 2021, and given a half-day tutorial about Fine-Grained Opinion Mining in IJCAI 2019.
Time: 16:25-17:00, 15th October, 2021
Title: Speech and NLP Research at Huawei Noah’s Ark Lab
Abstract : The Noah’s Ark Lab is the AI research center for Huawei Technologies. Its mission is to make significant contributions to both the company and society by innovating in AI. In this talk, I will introduce some important research work of Noah's Ark lab in Speech and NLP. The topics include speech processing, pretrained language models, multimodal processing, etc.
Speaker: Yichun Yin is a researcher at Huawei Noah’s Ark Lab. He received my PhD from the School of EECS at Peking University in 2018, where he was advised by Ming Zhang. Before that, he was an undergraduate student at University of Science and Technology Beijing.