Speaker: Yixin Cao
Topic: From Evaluation To Evolvement: Auto-benchmarking of Large Language Models
Abstract : In recent years, the field of artificial intelligence (AI) has witnessed extraordinary achievements with large language models (LLMs). These models have demonstrated remarkable capabilities across a wide range of tasks, thanks to their massive training corpora. However, the development of model evaluation methods has lagged behind. On one hand, while new evaluation datasets are continuously being created, LLMs, following the scaling laws, often quickly outpace these benchmarks, making the creation of new datasets a time-consuming and labor-intensive task. On the other hand, the risk of data leakage poses a challenge to current evaluation methods, as models may have already been exposed to test data during pre-training. To address these challenges, this presentation will introduce and delve into an innovative solution: the development of an automated evaluation framework specifically designed for LLMs. This framework includes the automatic generation of datasets and a multi-agent based automated measurement system. Additionally, we will analyze the robustness and reliability of this automated evaluation framework. Finally, we will demonstrate how the results of automated evaluations can be leveraged to continuously optimize model training (i.e., model evolution). Furthermore, we aim beyond mere scores by endeavoring to demystify the inner workings of vision-language models (VLMs), thereby enhancing the transparency and understanding within this domain.
Short Bio: In recent years, the field of artificial intelligence (AI) has witnessed extraordinary achievements with large language models (LLMs). These models have demonstrated remarkable capabilities across a wide range of tasks, thanks to their massive training corpora. However, the development of model evaluation methods has lagged behind. On one hand, while new evaluation datasets are continuously being created, LLMs, following the scaling laws, often quickly outpace these benchmarks, making the creation of new datasets a time-consuming and labor-intensive task. On the other hand, the risk of data leakage poses a challenge to current evaluation methods, as models may have already been exposed to test data during pre-training. To address these challenges, this presentation will introduce and delve into an innovative solution: the development of an automated evaluation framework specifically designed for LLMs. This framework includes the automatic generation of datasets and a multi-agent based automated measurement system. Additionally, we will analyze the robustness and reliability of this automated evaluation framework. Finally, we will demonstrate how the results of automated evaluations can be leveraged to continuously optimize model training (i.e., model evolution). Furthermore, we aim beyond mere scores by endeavoring to demystify the inner workings of vision-language models (VLMs), thereby enhancing the transparency and understanding within this domain.
Speaker: Pan Zhou
Topic: Watch Out the AI-Powered Judgment: Benchmarking MLLM-as-a-Judge and Unveiling LLM Vulnerabilities
Abstract : LLM-as-a-Judge leverages large language models (LLMs) to select the best response from a set of candidates for a given question, with applications in LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. Building on this, our work explores the extension of this concept to multimodal large language models (MLLMs), introducing MLLM-as-a-Judge as a novel benchmark to evaluate MLLMs’ judgment capabilities across multimodal tasks, including Scoring Evaluation, Pair Comparison, and Batch Ranking. While MLLMs exhibit human-like discernment in Pair Comparison tasks, we identify significant misalignments with human preferences in Scoring Evaluation and Batch Ranking due to biases, hallucinations, and inconsistencies. Furthermore, we investigate the vulnerabilities of the LLM-as-a-Judge, proposing JudgeDeceiver, an optimization-based prompt injection attack. JudgeDeceiver enables attackers to manipulate LLM-as-a-Judge to select specific, attacker-controlled responses, regardless of other candidates' quality. Our extensive evaluation shows JudgeDeceiver’s superiority over existing prompt injection and jailbreak attacks, and highlights the insufficiency of current defenses. This work underscores the need for both improving judgment capabilities and enhancing security mechanisms in LLMs and MLLMs acting as evaluators.
Short Bio: Pan Zhou is currently a full professor and PhD advisor with School of Cyber Science and Engineering, Huazhong University of Science and Technology (HUST), Wuhan, P.R. China. He received his Ph.D. in the School of Electrical and Computer Engineering at the Georgia Institute of Technology (Georgia Tech) in 2011, Atlanta, USA. He received his B.S. degree in the Advanced Class of HUST, and a M.S. degree from HUST, Wuhan, China, in 2006 and 2008, respectively. His research interest include: Big Data Analytics and its Security, AI Security. He received the “Rising Star in Science and Technology of HUST”in 2017. He is currently an associate editor of two JCR Q1 journals: IEEE Transactions on Network Science and Engineering and Elsevier Aledandria Engineering Journal.He has published over 200 papers at international conferences and journals, including: ACM CCS,Usenix Security, NeurIPS, ICML, IEEE CVPR, IEEE ICCV, ICLR,IEEE INFOCOM, IEEE ICDE, VLDB, IEEE/ACM TON, IEEE TIT, IEEE TIFS, IEEE TKDE. His google scholar cition is over 14,000 times. He is selected as the Word top 2% scientists from Elsevier and The World's Best Computer Scientists from Research.com. He has 5 oral CCF-A papers from ICML/CVPR/ACM MM/ACL, 6 ESI highly cited paper and 3 ESI 1‰ hot papers. He has received four best paper award from IEEE TETCI/IEEE TSE/IEEE ICIP/IEEE ACTAI. He has won the First Prize of Hubei Science and Technology Progress Award 2022 and "Special Prize" in the National "Challenge Cup" Competition 2023. He served as Chairs and TPCs for multiple international conferences, including: IJCAI、AAAI、CVPR, etc.
Speaker: Yankai Lin
Topic: Tool Learning with Foundation Models
Abstract : In recent years, large models have demonstrated remarkable application value in various fields such as natural language processing, computer vision, and biology. Through large-scale pre-training, these models have acquired extraordinary abilities in understanding, reasoning, planning, and decision-making within complex interactive environments. This, in turn, showcases their immense potential in utilizing tools to solve complex tasks in real-world scenarios. This report focuses on tool learning with foundation models, introducing how large models can comprehend and use various tools to accomplish tasks. It covers the unified tool leanring framework, its major challenges, recent important work, and future directions in this area.
Short Bio: Lin Yankai, Assistant Professor in Renmin University of China, Science Committee of CCF NOI. He published over 40 papers at top international conferences in category CCFA/B. As of July 2024, his works have been cited 14,109 times on Google Scholar, and has an H‑index of 44. His research areas include tool learning with foundation models and knowledge‑driven artificial intelligence technologies. His achievements have been recognized with First Prize for Natural Science of Ministry of Education (third contributor) and a Leading Scientific Achievement Award at the 2022 World Internet Conference (third contributor). From 2020 to 2023, he was consecutively named as Highly Cited China Researchers by Elsevier and World’s Top 2% Scientists by Stanford University.
Speaker: Chen Qian
Topic: A Preliminary Exploration of the Scaling Laws for Large Model Agent Collaboration
Abstract : Contemporary large model-driven group collaboration aims to create a virtual team operated by multiple agents, enabling them to autonomously generate comprehensive solutions through interactive collaboration when human users present specific task requirements. This approach achieves an efficient and economical reasoning process, providing new possibilities for automating the resolution of complex problems. Related technologies are expected to effectively liberate human labor from traditional heavy workloads, realizing the promising vision of "agents assisting human work". This report will, based on the key technologies of large model multi-agent collaboration, introduce technological advancements in areas such as interaction, collaboration, and evolution, and preliminarily explore the scaling laws of collaboration to guide the construction of efficient multi-agent systems.
Short Bio: A Ph.D. from the School of Software at Tsinghua University, currently serving as a postdoctoral researcher at the Tsinghua Natural Language Processing Lab (THUNLP) and a Shuimu Scholar at Tsinghua University. His primary research interests include pre-trained models, autonomous agents, and collective intelligence. He collaborates with Professors Maosong Sun and Zhiyuan Liu as supervisors. He has published several papers as the first author in international conferences and journals related to artificial intelligence, information management, and software engineering, such as ACL, SIGIR, ICLR, AAAI, and CIKM. In the field of collective intelligence, he has led the release of a large language model-driven collaborative framework, ChatDev, Co-Learning and MacNet.
Speaker: Hangyu Mao
Topic: From Reinforcement Learning-Based Agents to Large Language Model-Based Agents
Abstract : LLMs are among the most popular research topics in artificial intelligence, and AI Agents represent one of the most promising applications of LLMs. This talk will first review research on decision-making agents based on reinforcement learning. It will then introduce AI Agents research that leverages large language models. Finally, the talk will share some research insights.
Short Bio: Hangyu Mao is the head of the Agent/RAG Group for KwaiYii Large Language Model at Kuaishou Technology. He has published more than 30 papers in CCF-A/B conferences and journals, and holds several US and Huawei high-potential patents. His research has been successfully applied in enterprise scenarios, delivering substantial benefits. He has won several awards, including the NeurIPS Reinforcement Learning Competition Championship, the China Computer Federation "Outstanding Doctoral Dissertation Award in Multi-Agent Research", and the Huawei "Innovative Pioneer President Award".
Speaker: Chen Gao
Topic: EmbodiedCity: Large Languge Model Agent in Open-world Urban Environments
Abstract : Embodied intelligence is considered one of the most promising directions for artificial general intelligence, with human-like abilities interacting with the world. However, most works focus on bounded indoor environments, with limited literature on open-world scenarios. To address this, we release a new benchmark platform, named embodied-city, for embodied intelligence in urban environments. This platform includes a simulator and datasets on representative tasks for embodied intelligence in an urban environment.
Short Bio: Chen Gao is now a Faculty Member (Research-track AP) of BNRist, Tsinghua University. He obtained his Ph.D. Degree and Bachelor’s Degree from the Department of Electronic Engineering, Tsinghua University in 2021 and 2016, respectively. His research primarily focuses on data mining, large language model, embodied agent, etc., with over 60 papers in top-tier venues (50+ CCF-A), attracting over 3,500 citations. He received ACL 2024 Outstanding Paper Award, SIGIR 2020 Best Short Paper Honorable Mention Award, and Tsinghua University Outstanding Doctoral Dissertation Award, and was selected by Baidu Scholar as Top 100 Chinese Rising Stars in Artificial Intelligence in 2021. He was on the Finalist of 2021 China Computer Federation (CCF) Outstanding Doctoral Dissertation Award. He is also on Stanford/Elsevier Top 2% Scientists List 2024.