Call for Participation (NLP Challenge 2019)

2019 NLP Language and Intelligence Challenge is hosted by Chinese Information Processing Society of China (CIPS) and China Computer Federation (CCF), and organized jointly by Baidu Inc., Committee on Evaluation of CIPS (CIPS CE), and Technical Committee on Chinese Information Technology of CCF (CCF TCCI). official registration to the competition will be open on February 25, 2019. The winning teams will share a total of 270,000RMB bonus and the competition forum and award ceremony will be held at the Third Language & Intelligence Summit. All researchers and developers are welcomed to this competition.

Background

Language & Intelligence Summit, initiated by Chinese Information Processing Society of China and China Computer Federation, has been held successfully in 2016-2018. where researchers and experts from both academia and industry gathered and discussed the new technologies and developments in the field of Language and Intelligence. The Fourth Language & Intelligence Summit will be held in Beijing on August 24, 2019, featuring the latest development and innovation of language and intelligence, as well as the 2019 NLP Challenge on Machine Reading Comprehension.

Language is the most important medium for communication. Building machines’ ability of natural language understanding and interaction is an important challenge for us on the way to artificial general intelligence (AGI). In this year’s challenge, there are three tasks: machine reading comprehension, knowledge-driven dialogue and information extraction. The task of machine reading comprehension requires machines to read text and then answer questions about it. It aims at testing machines’ natural language understanding ability. Knowledge-driven dialogue is a kind of human-machine dialogue task, where machines converse with human based on a built knowledge graph. It aims at testing machines’ ability to conduct human-like conversations. Information extraction refers to automatic extraction of knowledge such as entities, their attributes and relations, from natural language texts, which enables machines to automatically construct knowledge graph from massive texts. These three tasks are leading-edge topics in the field of natural language processing and artificial intelligence and are of great significance to search, recommendation, intelligent interaction and other applications. We will provide large-scale Chinese datasets for all these three tasks. And hopefully, this will advance the development of language understanding and artificial intelligence.

Task Description

◇ Task1 - Machine Reading Comprehension

Given a question q, and a set of documents D = d1, d2, ..., dn, the participant MRC system is expected to output an answer a that best answers q based on the evidences in D.

◇ Input/Output

Input: Question q and its corresponding evidence document set D.

Output: An answer a that best answers q according to the document set D.

◇ Dataset

This task is an extension of 2018 NLP Challenge on Machine Reading Comprehension. The dataset contains about 280k questions sampled from real anonymized user queries from Baidu Search. Each question has 5 corresponding evidence documents and human generated answers. The dataset is divided into a training set (270k questions), a development set (about 3k questions) and a test set (about 7k questions). The training set is the same as that used in the 2018 competition, which has been released in DuReader and is available for free download. The complex questions that the winning systems in the 2018 competition fail to answer correctly are included in the development and test sets. On these complex questions, MRC systems perform substantially worse than humans, which is a real challenge to participating systems.

The new development and test sets will be available for download after the registration deadline.

◇ Evaluation Metrics

ROUGE-L and BLEU-4 are adopted as the basic evaluation metrics to measure the performance of participating systems. Results on the whole test set will be used for the final evaluation.

◇ Baseline Systems

Two open sourced baseline systems will be provided, for details please refer to: source and paper.

◇ Task2 - Knowledge-driven Dialogue

Given a dialogue goal g and a set of topic-related background knowledge M = f1 ,f2 ,..., fn, a participating system is expected to output an utterance“ut”for the current conversationH = u1, u2, ..., ut-1 , which keeps the conversation coherent and informative under the guidance of the given goal. During the dialogue, a participating system is required to proactively lead the conversation from one topic to another. The dialog goal g is given like this: START - > TOPIC_A - > TOPIC_B, which means the machine should lead the conversation from any start state to topic A and then to topic B. The given background knowledge includes knowledge related to topic A and topic B, and the relations between these two topics.

◇ Input/Output

Input: Dialogue Goalg, Background Knowledgel M and Dialogue Historyl H.

Output: An utterance “ut” appropriate for the current conversation H and the given goal g.

◇ Dataset

The background knowledge provided in the dataset is collected from the domain of movies and stars, including information such as box offices, directors, reviews and etc., organized in the form of {entity, property, value}. The topics given in dialogue goal are entities, i.e., movies or stars.

The data set includes 100,000 training sets, 10,000 development sets and 10,000 test sets.

◇ Evaluation Metrics

1. Automatic Evaluation Metrics

F1: char-based F-score of output responses against golden responses, the main metric for models.

BLEU: word-based precision of output responses against golden responses, the auxiliary metric for models.

DISTINCT: diversity of the output responses, the auxiliary metric for models.

Based on the evaluation results, we will rank all participating models on the leaderboard.

2. Human Evaluation

The top 10 performing models on the leaderboard will be evaluated by humans on criteria including fluency, consistency, proactivity,

The final rankings and winners will be determined based on the human evaluation results.

◇ Baseline Systems

Open-source baseline systems will be provided. For the implementation of the baseline systems, please refer to the update on the official website.

◇ Task3 - Information Extraction

Given a sentence sent and a list of pre-defined schemas which define the relation P and the classes of its corresponding subject S and object O, for example, (S_TYPE: Person, P: wife, O_TYPE: Person), (S_TYPE: Company, P: founder, O_TYPE: Person), a participating IE system is expected to output all correct Triples [(S1, P1, O1), (S2, P2, O2)…] mentioned in sent under the constraints of given schemas.

◇ Input/Output

Input: A list of pre-defined schemas and a sentence sent.

Output: Triples mentioned in sent under the constraints of given schemas.

◇ Dataset

The SKE dataset used in this competition is the largest schema-based Chinese information extraction dataset in the industry, containing more than 430,000 SPO triples in over 210,000 real-world Chinese sentences, bounded by a pre-specified schema with 50 types of predicates. All sentences in SKE Dataset are extracted from Baidu Baike and Baidu News Feeds. The dataset is divided into a training set (170k sentences), a development set (20k sentences) and a testing set (20k sentences). The training set and the development set are to be used for training and are available for free download. The test set is divided into two parts, the test set 1 is available to participants for self-verification, the test set 2 will be released one week before the end of the competition and used for the final evaluation.

◇ Evaluation Metrics

Precision, Recall and F1 score are used as the basic evaluation metrics to measure the performance of participating systems. The results predicted by the participant systems will be exactly matched with the true triples annotated on the testing set. (Considering the cases of alias, we will use the alias dictionary of Baidu Knowledge Graph in the evaluation).

◇ Baseline Systems

An open source information extraction baseline system will be released on the official website.

Prizes

This competition will award one First Prize, two Second Prizes and two Third Prizesfor each task. Winners will get the award certificates issued by CIPS & CCF. The prizes and travel grants for attending the competition forum and award ceremony will be sponsored by Baidu.

◇ First prize： 30,000 RMB + award certificate

◇ Second prize: 20,000 RMB + award certificate

◇ Third Prize: 10,000 RMB + award certificate

Schedule

◇ Feb 25, 2019: Open Registration;

◇ Mar 31, 2019: Registration Deadline (Dataset Ready for Download);

◇ May 13, 2019: Release of the Final Datasets;

◇ May 20, 2019: Results Submission Deadline;

◇ May 31, 2019: Winners Notification;

◇ Jul 31, 2019: Camera-ready Submission Deadline;

◇ Aug 24, 2019: Competition Forum and Award Ceremony on Language & Intelligence Summit;

◇ Oct, 2019: Oral Representations on the NLPCC Workshop;

Organization

◇ Hosts

● Chinese Information Processing Society of China (CIPS)

● China Computer Federation (CCF)

◇ Organizer

● Baidu Inc.

● Committee on Evaluation of CIPS (CIPS CE)

● Technical Committee on Chinese Information Technology of CCF (CCF TCCI)

◇ Steering committee

● Le Sun, Institute of Software, Chinese Academy of Sciences

● Ming Zhou, Microsoft Research Asia

● Erhong Yang, Beijing Language and Culture University

● Dongyan Zhao, Peking University

● Hua Wu, Baidu Inc.

◇ Organizing committee

● Quan Wang, Baidu Inc

● Weiwei Sun, Peking University

● Xianpei Han, Institute of Software, Chinese Academy of Sciences

● Nan Duan, Microsoft Research Asia

● Jing Liu, Baidu Inc.

● Wenquan Wu, Baidu Inc.

● Yabin Shi, Baidu Inc.

Registration

Official registration: The official registration has been opened on Feb 25, 2019 and will be closed on Mar 31, 2019. All registered participants who submit valid results will receive customized T-shirts.

Registration Website: http://lic2019.ccf.org.cn/

If you have any question or suggestion, please feel free to contact us.

Contact email: lic2019@126.com

2019语言与智能技术竞赛

2019 Language and Intelligence Challenge

2019语言与智能技术竞赛由中国计算机学会(CCF)和中国中文信息学会 (CCF), （CIPS）联合主办，百度公司、中国计算机学会中文信息技术专委会和中国中文信息学会评测工作委员会联合承办。竞赛已于2019年2月25日正式开启. 报名通道，获胜团队将分享总额27万人民币的奖金，并将在第四届“语言与智能高峰论坛”举办技术交流和颁奖。在此，诚邀学术界和工业界的研究者和开发者参加本次竞赛.

竞赛背景

由中国计算机学会(CCF)和中国中文信息学会于2016-2018年联合发起了三届“语言与智能高峰论坛”，邀请了国内外学术界和工业界的知名专家学者，共同探讨语言与智能领域的新发展和新技术。第四届“语言与智能高峰论坛”将于2019年8月24日在北京召开，除向社会公众介绍国内外语言与智能及相关领域的发展趋势和创新成果外，本届会议还将举办语言与智能技术竞赛，进一步推动语言与智能领域的技术交流和发展。

语言是人类信息传递最重要的媒介，让机器理解语言并使用语言进行交互是走向通用人工智能的重要挑战，本届竞赛设立了三个任务：机器阅读理解、知识驱动对话和信息抽取。机器阅读理解是指让机器阅读文本然后回答和阅读内容相关的问题，旨在使机器具备理解自然语言的能力；知识驱动对话是一种人机对话任务，让机器根据构建的知识图谱进行对话，旨在使机器具备模拟人类用语言进行信息交流的能力；信息抽取是指让机器自动从自然语言文本中抽取实体、属性、关系等知识信息，旨在使机器具备从海量文本信息中自动构建知识的能力。本次竞赛的任务涉及到语言理解、人机对话、知识抽取等复杂技术，极具挑战。这些任务的研究对于智能搜索、智能推荐、智能交互等人工智能应用具有重要意义，是自然语言处理和人工智能领域的重要前沿课题。本次竞赛设立的三个任务都将提供大规模中文数据集，为研究者提供学术交流平台，推动语言理解和人工智能领域技术研究和应用的发展。

竞赛任务介绍

◇ 竞赛任务1 - 机器阅读理解

对于给定问题 q及其对应的文本形式的候选文档集合 D = d1, d2, ..., dn, 要求参评阅读理解系统自动对问题及候选文档进行分析，输出能够满足问题的文本答案 a 目标是q能够正确、完整、简洁地回答问题 q.

◇ 输入/输出

输入: 问题q及其对应的候选文档集合 D.

输出: 满足用户问题 qa.

◇ 数据集

本届竞赛的机器阅读理解任务是“2018机器阅读理解技术竞赛”的延伸。任务数据集包含约28万来自百度搜索的真实问题，每个问题对应5个候选文档文本，以及人工撰写的优质答案。数据集划分为包含27万个问题的训练集、3000个问题的开发集和7000个问题的测试集。本次任务的训练集与2018年竞赛的训练集相同，已在 DuReader发布，可自由下载( 下载地址.) 供参赛者训练和调试模型使用。开发集和测试集则由2018年竞赛中的优胜系统未能较好回答的复杂问题构成。对于这些复杂问题，机器的答案质量仍然显著低于人类的答案，是当前阅读理解模型和系统需要进一步攻克的难关。报名截止后，新的开发集和测试集将向竞赛报名团队开放.

◇ 评价方法

基于测试集的人工标注答案，采用ROUGE-L和BLEU-4作为评价指标。全部测试集结果作为最终评价结果。.

◇ 基线系统

竞赛将提供两个开源的阅读理解基线系统，基线系统的实现及结果评价请参考：开源系统和数据集论文.

◇ 竞赛任务2 - 知识驱动对话

给定对话目标g 及相关知识信息 M = f1 ,f2 ,..., fn,“H=u1,u2,...,ut-1”的机器回复ut, 使得对话自然流畅、信息丰富而且符合对话目标的规划。在对话过程中，机器处于主动状态，引导用户从一个话题聊到另一个话题。因此，对话系统为机器设定了一个对话目标，g为“START->TOPIC_A->TOPIC_B”, 表示从冷启动状态主动聊到话题A，然后进一步聊天到话题B，提供的相关知识信息为话题A的知识信息，话题B的知识信息，话题A和话题B的关联信息。

◇ 输入/输出

输入: 对话目标g, 知识信息 M和对话历史 H.

输出: 适合回复对话历史，且符合对话目标的回复“u”

◇ 数据简介

数据中的知识信息来源于电影和娱乐人物领域有聊天价值的知识信息，如票房、导演、评价等，以三元组SPO的形式组织。对话目标中的话题为电影或娱乐人物实体。数据集中包括10万训练集，1万开发集，1万测试集。

◇ 评价方法

1.自动评估指标:

F1:评估输出回复相对于标准回复在字级别上的准确召回性能，是评估模型性能的主指标；

BLEU: 评估输出回复相对于标准回复在词级别上的性能，是评估模型性能的辅助指标；

DISTINCT: 评估输出回复的多样性，是评估模型性能的辅助指标；以上自动指标将用于排行榜上的排行。

2.人工评估指标：

排行榜前10个对话系统进入人工评估阶段，从流畅性、一致性和主动性等几个维度进行评估

最终排名以人工评估结果为依据。

◇ 基线系统

竞赛将提供开源的知识驱动对话基线系统，基线系统的实现及结果评价请参考官方网站的更新。

◇ 竞赛任务3 - 信息抽取

给定schema约束集合及句子sent，其中schema定义了关系P以及其对应的主体S和客体O的类别，例如（S_TYPE:人物，P:妻子，O_TYPE:人物）、（S_TYPE:公司，P:创始人，O_TYPE:人物）等。任务要求参评系统自动地对句子进行分析，输出句子中所有满足schema约束的SPO三元组知识Triples=[(S1, P1, O1), (S2, P2, O2)…]。

◇ 输入/输出

输入: schema约束集合及句子sent

输出: 句子sent中包含的符合给定schema约束的三元组知识Triples

◇ 数据集

本次竞赛使用的数据集是业界规模最大的基于schema的中文信息抽取数据集，其包含50个预定义的schema、21万中文句子及43万三元组数据。数据集中的句子来自百度百科和百度信息流文本。数据集划分为17万训练集，2万验证集和2万测试集。其中训练集和验证集用于训练，可供自由下载，测试集分为两个，测试集1供参赛者在平台上自主验证，测试集2在比赛结束前一周发布，将作为最终的评测排名。

◇ 评价方法

参赛者在测试集上给出的SPO结果和测试集标注结果进行精准匹配（考虑了别名情况，本次竞赛使用了百度知识图谱现有的别名词典来辅助评测），采用Precision, Recall和F1值作为评价指标。

◇ 基线系统

竞赛将提供一个开源的信息抽取基线系统，请关注官网后续更新。

奖项设置

竞赛的每个任务都将分别评出一等奖1名，二等奖2名，三等奖2名，主办方中国计算机学会（CCF）和中国中文信息学会（CIPS）将为获奖者提供荣誉证书认证。同时，百度公司将为获奖者提供奖金和参会交流赞助。

◇ 一等奖： 30,000 RMB +荣誉证书

◇ 二等奖: 20,000 RMB +荣誉证书

◇ 三等奖: 10,000 RMB +荣誉证书

重要时间

◇ Feb 25, 2019: 启动竞赛报名，竞赛平台开放，发放样例数据;

◇ Mar 31, 2019: 报名截止，对报名者发放全部训练数据和第一批测试数据;