◆ October 9, 2015
8:30-9:00, Opening Ceremony, Group Photo
Lecture 1: Search and Discovery for Big Data, by Xueqi Cheng, Institute of Computing Technology, Chinese Academy of Sciences
Lesson 1: 09:00-09:50, Top-k learning-to-rank and short-text topic modeling
Lesson 2: 10:10-11:00, Multi-context recommendation and social network analysis
Lesson 3: 11:20-12:00, Scalable influence maximization, popularity prediction and collective behavior analysis
Lecture 2: Key Technologies behind a Live Social Observatory System, by Tat-Seng Chua, National University of Singapore
Lesson 1: 14:00-14:50, Live social observatory system developed at NExT
Lesson 2: 15:10-16:00, Key research efforts to tackle the challenges in social media analysis
Lesson 3: 16:20-17:00, Summarization and QA
◆ October 10, 2015
Lecture 3: Construction and Mining of Text-Rich Heterogeneous Information Networks, by Jiawei Han, University of Illinois at Urbana-Champaign
Lesson 1: 09:00-09:50, Phrase mining and concept discovery from massive text data
Lesson 2: 10:10-11:00, Entity recognition and typing for construction of text-rich heterogeneous information networks
Lesson 3: 11:20-12:00, Mining text-rich heterogeneous information networks
Lecture 4: Big Data Intensive Computation: concepts, Research Issues and some Solutions ,by Jianzhong Li, Harbin Institute of Technology
Lesson 1: 14:00-14:50, Big Data Intensive Computation: concepts
Lesson 2: 15:10-16:00, Big Data Intensive Computation: research Issues
Lesson 3: 16:20-17:00, Some Solutions from Harbin Institute of Technology
◆ October 11, 2015
Lecture 5: Learning to Process Natural Language in Big Data Environment, by Hang Li, Noah’s Ark Lab, Huawei Technologies
Lesson 1: 09:00-09:50, Deep Learning for Natural Language Processing: Essentials
Lesson 2: 10:10-11:00, Deep Learning for Natural Language Processing: Applications
Lesson 3: 11:20-12:00, Summarization and QA
Lecture 6: Deep Learning for Chinese Information Processing, by Chao Liu, Chief Scientist and the General Manager of the Data Science Department, Sogou
Lesson 1: 14:00-14:50, Basic embedding techniques
Lesson 2: 15:10-16:00, Feed-forward and recurrent neural networks for NLP applications
Lesson 3: 16:20-17:00, Self-built computing platform
17:00-17:20 Closing Ceremony
◇ Title: Search and Discovery for Big Data
◇ ABSTRACT: Big data opens a new era for data-driven scientific discovery and data-driven services. It is revolutionizing paradigm of science and poses several critical issues for modern research. Firstly, the rate of data accumulation outperforms the rate of the improvement of computational power. We have to develop faster algorithms or to design new methods that only work on part of the whole data. Secondly, we lack theory for data complexity that could guide us design good methods balancing data-complexity and task-quality. Finally, prediction-oriented tasks gradually dominate multiple scientific fields, requiring the capability to model the behavior of complex interaction behavior or mechanisms underlying complex systems. These issues motivate us to rethink our current research, fostering several new ideas or methods. In this talk, I will introduce our recent works on Web search and data mining. Specifically, this talk will cover top-k learning-to-rank, short-text topic modeling, multi-context recommendation, social network analysis, scalable influence maximization, popularity prediction, collective behavior analysis. In these specific research works, I will particularly introduce how big data transforms our research and what we could response to the challenges raised by big data.
◇ Short Bio: Dr. Xueqi Cheng is a professor in the Institute of Computing Technology, Chinese Academy of Sciences (CAS), and the director of the CAS Key Laboratory of Network Data Science and Technology. His main research areas include Web search and data mining, data science, big data system, and social media analytics.
He is the general secretary of CCF Task Force on Big Data, the vice-chair of CIPS Task Force on Chinese Information Retrieval. He is the associate editor of IEEE Transactions on Big Data, Editorial Board Member of Journal of Computer Science and Technology and Chinese Journal of Computer. He was the general co-chair of WSDM’15, Steering Committee co-chair of IEEE Conference on Big Data, and PC members of more than 20 conferences, including ACM SIGIR, WWW, ACM CIKM, ACL, IEEE ICDM, IJCAI, and ACM WSDM.
He has more than 100 publications, and was awarded the Best Paper Award in ACM CIKM’11, and the Best Student Paper Award in ACM SIGIR’12. He is the principal investigator of more than 10 major research projects, funded by NSFC and MOST. He was awarded the NSFC Distinguished Youth Scientist (2014), the National Prize for Progress in Science and Technology (2012), the China Youth Science and Technology Award (2011) et al.
◇ Title: Key Technologies Behind A Live Social Observatory System
◇ ABSTRACT: Given the popularity of social networks, users are sharing information on multiple aspects of their life on a wide variety of social networks. For any given topic, there are wide varieties of both social and non-social information from multiple sources. The challenges in social media analysis are multi-fold. The first and most fundamental problem is the ability to gather “representative” data about the topic from multiple sources. This is particular challenging for hot topics with many live data streams. As the key difference between social media and Web retrieval is the presence of huge amount of noise in social media data streams, hence the second challenge is the removal of noise, both in data and user accounts. With the social media contents becoming increasingly multimedia, the third challenge is how to infer user signals from non-textual contents. The fourth challenge is the detection and tracking of sub-topics, along with insights on users’ sentiments, interests and demographics. Finally, for most organizations, they would like to know what social media posts or sub-events related to them are likely to become viral, and what actions they can take.
This tutorial is divided into 3 parts. The first part describes a live social observatory system that we have developed at NExT, a joint Center between National University of Singapore and Tsinghua University. The second part details the key research efforts to tackle the above five challenges, including our research to transform unstructured social media data into descriptive, predictive and prescriptive analytics. The third part looks into the future by examining our achievements in the last 5 years. In the coming years, the social media networks will evolve from mere communication tools to co-creation and co-invention platforms with more live data streams; and from more analysis towards more predictive and prescriptive analytics.
◇ Short Bio: Dr Chua is the KITHCT Chair Professor at the School of Computing, National University of Singapore. He was the Acting and Founding Dean of the School during 1998-2000. Dr Chua's main research interest is in multimedia information retrieval and social media analysis. In particular, his research focuses on the extraction, retrieval and question-answering (QA) of text, video and live media arising from the Web and social networks. He is the Director of a multi-million-dollar joint Center between NUS and Tsinghua University in China to develop technologies for live media search. The project will gather, mine, search and organize user-generated contents within the cities of Beijing and Singapore. His group participated regularly in TREC-QA and TRECVID evaluations in early 2000.
Dr Chua is active in the international research community. He has organized and served as program committee member of numerous international conferences in the areas of computer graphics, multimedia and text processing. He is the conference co-chair of ACM Multimedia 2005, ACM CIVR (now ACM ICMR) 2005, ACM SIGIR 2008, and ACM Web Science 2015. He serves in the editorial board of: ACM Transactions of Information Systems (ACM), Foundation and Trends in Information Retrieval (NOW), The Visual Computer (Springer Verlag), and Multimedia Tools and Applications (Kluwer). He is the Chair of steering committee of ICMR (International Conference on Multimedia Retrieval) and Multimedia Modeling conference series; and as member of International Review Panel of a large-scale research project in Europe. He is the co-Founder of two technology startup companies and an independent Director of a publicly listed company in Singapore.
◇ Title: Construction and Mining of Text-Rich Heterogeneous Information Networks
◇ ABSTRACT: Massive amounts of data are natural language text-based, unstructured, noisy, untrustworthy, but are interconnected, potentially forming gigantic, interconnected information networks. If such text-rich data can be processed and organized into multiple typed, semi-structured heterogeneous information networks, organized knowledge can be mined from such networks. Most real world applications that handle big data, including interconnected social networks, medical information systems, online e-commerce systems, or Web-based forum and data systems, can be structured into typed, heterogeneous social and information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale, text-rich heterogeneous information networks poses an interesting but critical challenge.
In this talk, we present an overview of recent studies on construction and mining of text-rich heterogeneous information networks. We show that relatively structured heterogeneous information networks can be constructed from unstructured, interconnected, text data, and such relatively structured, heterogeneous networks brings tremendous benefits for data mining. Departing from many existing network models that view data as homogeneous graphs or networks, the text-based, semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data. This heterogeneous network modeling will lead to the discovery of a set of new principles and methodologies for mining text-rich, interconnected data. We will also point out some promising research directions and provide arguments on that construction and mining of text-rich heterogeneous information networks could be a key to information management and mining.
◇ Short Bio: Jiawei Han, Abel Bliss Professor of Computer Science, University of Illinois at Urbana-Champaign. He has been researching into data mining, information network analysis, database systems, and data warehousing, with over 700 journal and conference publications. He has chaired or served on many program committees of international conferences, including PC co-chair for KDD, SDM, and ICDM conferences, and Americas Coordinator for VLDB conferences. He also served as the founding Editor-In-Chief of ACM Transactions on Knowledge Discovery from Data and is serving as the Director of Information Network Academic Research Center supported by U.S. Army Research Lab, and Director of KnowEnG, a BD2K (Big Data to Knowledge) center supported by NIH. He is a Fellow of ACM and Fellow of IEEE. He received 2004 ACM SIGKDD Innovations Award, 2005 IEEE Computer Society Technical Achievement Award, 2009 IEEE Computer Society Wallace McDowell Award, and 2011 Daniel C. Drucker Eminent Faculty Award at UIUC. His book "Data Mining: Concepts and Techniques" has been used popularly as a textbook worldwide.
◇ Title: Big Data Intensive Computation: Concepts, Research Issues and Some Solutions
◇ ABSTRACT: Recent years, big data intensive computation is becoming an important research area in response to the rapidly growing of big data and the need of high performance analyzing of big data. Big data intensive computation is different from conventional computation since they acquire and maintain continually changing big data and perform large-scale computations over big data. Big data intensive computation open up new opportunities to achieve great advances in science, biology and health care, industry and business efficiencies, and so on. This talk will discuss the concepts, motivation, challenges and research issues of DISCS. Some solutions from Harbin Institute of Technology are also presented.
◇ Short Bio: Jianzhong Li is a professor in the Department of Computer Science and Engineering at Harbin Institute of Technology, China. He worked in the Department of Computer Science at Lawrence Berkeley National Laboratory in USA, as a scientist, from 1986 to 1987 and from 1992 to 1993. He was also a visiting professor at the University of Minnesota at Minneapolis, Minnesota, USA, from 1991 to 1992 and from 1998 to 1999. His research interests include massive data intensive computing and wireless sensor networks. He has published more than 200 papers in refereed journals and conference proceedings, such as VLDB Journal, Algorithmica, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Parallel and Distributed Systems, SIGMOD, SIGKDD, VLDB, ICDE, INFOCOM. His papers have been cited more than 12000 times. He has been involved in the program committees of major computer science and technology conferences, including SIGMOD, VLDB, ICDE, INFOCOM, ICDCS, and WWW. He also served on the editorial boards for distinguished journals, including IEEE Transactions on Knowledge and Data Engineering, and refereed papers for varied journals and proceedings.
◇ Title: Learning to Process Natural Language in Big Data Environment
◇ ABSTRACT: With big data and deep learning (DL), natural language processing (NLP) is entering into a new era. In fact, DL stands out as the most effective and promising approach to learning of complicated models for building intelligent systems, among many machine learning techniques. The combination of big data and DL really provides tremendous new opportunities for making breakthroughs in NLP. Indeed NLP has observed significant progress in recent years, with advanced deep learning methods developed and big data utilized. In this lecture, I will give a survey on deep learning for natural language processing (DL4NLP), including some of the work done at Huawei Noah’s Ark Lab. I will particularly focus on four major applications in NLP, namely search, question answering, natural language dialogue, and machine translation. I will conclude my lecture by summarizing the challenges and opportunities in research on DL4NLP.
◇ Short Bio: Hang Li is director of the Noah’s Ark Lab of Huawei Technologies. His research areas include information retrieval, natural language processing, statistical machine learning, and data mining. He graduated from Kyoto University in 1988 and earned his PhD from the University of Tokyo in 1998. He worked at the NEC lab in Japan during 1991 and 2001, and Microsoft Research Asia during 2001 and 2012. He joined Huawei Technologies in 2012. Hang has published three technical books and more than 100 scientific papers at top international journals and conferences, including SIGIR, WWW, WSDM, ACL, EMNLP, ICML, NIPS, and SIGKDD. He and his colleagues’ papers received the SIGKDD’08 best application paper award, the SIGIR’08 best student paper award, and the ACL’12 best student paper award. Hang worked on the development of several products such as Microsoft SQL Server 2005, Microsoft Office 2007 and Office 2010, Microsoft Live Search 2008, Microsoft Bing 2009, Bing 2010. He has more than 35 granted US patents. Hang has also been very active in the research communities and is serving top international conferences as PC chair, Senior PC member, or PC member, including SIGIR, WWW, WSDM, ACL, EMNLP, NIPS, SIGKDD, ICDM, and top international journals as associate editor, including CL, IRJ, TIST, JASIST, JCST.
◇ Title: Deep Learning for Chinese Information Processing
◇ ABSTRACT: This tutorial surveys the recent developments of Chinese Information Processing, as powered by the advancement of deep learning techniques. We focus on the paradigm shift from traditional statistical models to uniform deep neural networks, and demonstrate the resultant differences on various applications. We start by explaining the basic five embedding techniques that map words, characters, and radicals into numerical vectors, and illustrate how to build feed-forward and recurrent neural networks on top of vectors for different applications, such as Chinese word segmentation, part-of-speech and entity tagging, machine translation and search ranking etc. Finally, we briefly review our self-built computing platform consisting of hundreds of GPUs, which restlessly powers all the work as discussed above.
◇ Short Bio: Chao Liu is the Chief Scientist, and the General Manager of the Data Science Department in Sogou Inc. Dr. Liu is elected to China "Top 1000 Young Talents", the highest honor for young oversea researchers returning to China. Before coming back to China, he was a researcher and manager of the Data Intelligence Group in Microsoft Research at Redmond. His research has been focused on Web search/ads and data mining, with about 40 conference/journal publications and many research results transferred to Microsoft Bing, Tencent Soso, and Sogou search engines. Dr. Liu has been on the program and organizing committees of many conferences, including SIGIR, SIGKDD, WWW, etc., and actively campaigns for the mutualism between academia and industry. Dr. Liu earned his PhD in Computer Science from the University of Illinois at Urbana-Champaign in 2007, and B.S. in Computer Science from Peking University in 2003.