Best Papers:
Dec. 5, 2016(10:50–11:50)
Yunan International Conference Center(云安国际会议厅,2号厅),
Chair: Nianwen XUE
Return
|
10:50-11:20 | Zhicheng He, Jie Liu, Caihua Liu, Yuan Wang, Airu Yin and Yalou Huang show abstract hide abstractABSTRACT: Non-negative Matrix Factorization (NMF) can learn interpretable parts-based representations of natural data, and is widely applied in data mining and machine learning area. However, NMF does not always achieve good performances as the non-negative constraint leads learned features to be non-orthogonal and overlap in semantics. How to improve the semantic independence of latent features with-out decreasing the interpretability of NMF is still an open research problem. In this paper, we put forward dropout NMF and its extension sequential NMF to enhance the semantic independence of NMF. Dropout NMF prevents the co-adaption of latent features to reduce ambiguity while sequential NMF can further promote the independence of individual latent features. The proposed algorithms are different from traditional regularized and weighted methods, because they re-quire no prior knowledge and bring in no extra constraints or transformations. Extensive experiments on document clustering show that our algorithms outper-form baseline methods and can be seamlessly applied to NMF based models. |
11:20-11:50 | Tao Ge, Lei Cui, Heng Ji, Baobao Chang and Zhifang Sui show abstract hide abstractABSTRACT: We study an open text mining problem -- discovering concept-level event associations from a text stream. We investigate the importance and challenge of this task and propose a novel solution by using event sequential patterns. The proposed approach can discover important event associations implicitly expressed. The discovered event associations are general and useful as knowledge for applications such as event prediction. |
Fundamentals:
Dec. 5, 2016(13:30–15:10)
No.1, 2nd Floor,Yunan Auditorium(云安会堂2楼1号), Chair: Guangyou ZHOU Return
|
13:30-13:50 | Junjie Yu, Wenliang Chen, Zhenghua Li and Min Zhang show abstract hide abstractABSTRACT: In this paper, we present an approach to building dependency parsers for the resource-poor languages without any annotated resources on the target side. Compared with the previous studies, our approach requires less human annotated resources. In our approach, we first train a POS tagger and a parser on the source treebank. Then, they are used to parse the source sentences in bilingual data. We obtain auto-parsed sentences (with POS tags and dependencies) on the target side by projection techniques. Based on the fully projected sentences, we can train a base POS tagger and a base parser on the target side. But most of sentence pairs are not fully projected, so we get lots of partially projected sentences. To make full use of partially projected sentences, we implement a learning algorithm to train POS taggers, which leads to better parsing performance. We further exploit a set of features from the large-scale monolingual data to help parsing. Finally, we evaluate our proposed approach on Google Universal Treebank (v2.0, standard). The experimental results show that the proposed approach can significantly improve parsing performance. |
13:50-14:10 | Taizhong Wu, Jian Liu, Xuri Tang, Min Gu, Yanhui Gu, Junsheng Zhou and Weiguang Qu show abstract hide abstractABSTRACT: The development in society and technology generates more Nominal Compounds to represent new concepts in various domains. Earlier literature in linguistic stud-ies has gathered and established several syntactic categories of Nominal Com-pounds, which can be used for automatic syntactic categorization of these com-pounds. This paper is focused on Nominal Compounds of head-modifier con-struction because experiments show that most Nominal Compounds are head-modifier constructions. Based on the combination of templates and word similar-ity, this paper proposes an algorithm for automatic semantic interpretation which improves the recall ratio while maintaining the precision ratio. The results of syn-tactic categorization and automatic semantic interpretation of the Nominal Com-pounds are also applied in dependency parsing and machine translation. |
14:10-14:30 | yatian shen, Jifan Chen and Xuanjing Huang show abstract hide abstractABSTRACT: Semantic interaction between
text segments, which has been proven to be very useful for detecting the paraphrase relations, is often ignored in the study of paraphrase identification. In this paper, we adopt a neural network model for paraphrase identification, called as bidirectional Long Short-Term Memory-Gated Relevance Network (Bi-LSTM+GRN). According to this model, a gated relevance network is used to capture the semantic interaction between text segments, and then aggregated using a pooling layer to select the most informative interactions. Experiments on the Microsoft Research Paraphrase Corpus (MSRP) benchmark dataset show that this model achieves better performances than hand-crafted feature based approaches as well as previous neural network models. |
14:30-14:50 | Yang Du, Hua Yuan and Yu Qian show abstract hide abstractABSTRACT: The discovery of new words is of great significance to natural language processing of the Chinese language. In recent years, the word vector trained from neural network language model has shown a good semantic relationship. Accordingly, the word vector is applied to the Chinese new word discovery for the first time. In particular, we propose a new unsupervised method of new word discovery based on the classical n-gram method, which trains word vector from massive text and then prunes words according to word vector. Compared to some unsupervised methods such as mutual Information and adjacent entropy, the experiment results show great progress. |
14:50-15:10 | Minghua Nuo, Congjun Long and Huidan Liu show abstract hide abstractABSTRACT: This paper presents an identification framework for extracting Tibetan multi-word expressions. The framework includes two phases. In the first phase, sentences are segmented and high-frequency word-based n-grams are extracted using Nagao’s N-gram statistical algorithm and Statistical Substring Reduction Algorithm. In the second phase, the Tibetan MWEs are identified by the proposed framework which based on the combination of context analysis and language model-based analysis. Context analysis, two-word Coupling Degree and Tibetan syllable inside word probability are three strategies in Tibetan MWE identification framework. In experimental part, we evaluate the effec-tiveness of three strategies on small test data, and evaluate results of different granularity for Context analysis. On small test corpus, F-score above 75% have been achieved when words are segmented in pre-processing. On larger corpus, the P@N (N is 800) overcomes 85%. It indicates that the identification frame-work can work well on larger corpus. The experimental result reaches acceptable performance for Tibetan MWEs. |
Machine Translation I:
Dec. 5, 2016(13:30–15:10)
No.2, 2nd Floor,Yunan Auditorium(云安会堂2楼2号),
Chair: Dongdong ZHANG
Return
|
13:30-13:50 | Qiang Li, Dongdong Zhang, Mu Li, Tong Xiao and Jingbo Zhu show abstract hide abstractABSTRACT: Word deletion (WD) problems have a critical impact on the adequacy of translation and can lead to poor comprehension of lexical meaning in the translation result. This paper studies how the word deletion problem can be handled in statistical machine translation (SMT) in detail. We classify this problem into desired and undesired word deletion based on spurious and meaningful words. Consequently, we propose four effective models to handle undesired word deletion. To evaluate word deletion problems, we develop an automatic evaluation metric that highly correlates with human judgement. Translation systems are simultaneously tuned for the proposed evaluation metric and BLEU using minimum error rate training (MERT). The experimental results demonstrate that our methods achieve significant improvements in word deletion problems on Chinese-to-English translation tasks. |
13:50-14:10 | Guoping Huang, Jiajun Zhang, Yu Zhou and Chengqing Zong show abstract hide abstractABSTRACT: Terms extensively exist in specific domains, and term translation plays a critical role in domain-specific statistical machine translation (SMT) tasks. However, it's a challenging task to extract term translation knowledge from parallel sentences because of the error propagation in the SMT training pipeline. In this paper, we propose a simple, straightforward and effective model to mitigate the error propagation and improve the quality of term translation. The proposed model goes from initial weak monolingual detection of terms based on naturally annotated resources (e.g. Wikipedia) to a stronger bilingual joint detection of terms, and allows the word alignment to interact. The extensive experiments show that our method substantially boosts the performance of bilingual term detection by more than 8 points absolute F-score. And the term translation quality is substantially improved by more than 3.66\% accuracy, as well as the sentence translation quality is significantly improved by 0.38 absolute BLEU points, compared with the strong baseline, i.e. the well tuned Moses. |
14:10-14:30 | Shaohui Kuang and Deyi Xiong show abstract hide abstractABSTRACT: Neural machine translation (NMT) is an emerging machine translation paradigm that translates texts with an encoder-decoder neural architecture. Very recent studies find that translation quality drops significantly when NMT translates long sentences. In this paper, we propose a novel method to deal with this issue by segmenting long sentences into several clauses. We introduce a split and reordering model to collectively detect the optimal sequence of segmentation points for a long source sentence. Each segmented clause is translated by the NMT system independently into a target clause. The translated target clauses are then concatenated without reordering to form the final translation for the long sentence. On NIST Chinese-English translation tasks, our segmentation method
achieves a substantial improvement of 2.94 BLEU points over the NMT baseline on translating long sentences with more than 30 words, and 5.43 BLEU points on sentences of over 40 words. |
14:30-14:50 | Longhua Qian, Jiaxin Liu, Guodong Zhou and Qiaoming Zhu show abstract hide abstractABSTRACT: Active learning is an effective machine learning paradigm which can significantly reduce the amount of labor for manually annotating NLP corpora while achieving competitive performance. Previous studies on active learning are focused on corpora in one single language or two languages translated from each other. This paper proposes a Bilingual Parallel Active Learning paradigm (BPAL), where an instance-level parallel Chinese and English corpus adapted from OntoNotes is augmented for relation extraction and both the seeds and jointly selected unlabeled instances at each iteration are parallel between two languages in order to enhance active learning. Experimental results on the task of relation classification on the corpus demonstrate that BPAL can significantly outperform monolingual active learning. Moreover, the success of BPAL suggests a new way of annotating parallel corpora for NLP tasks in order to induce two high-performance classifiers in two languages respectively. |
14:50-15:10 | Qingsong Ma show abstract hide abstractABSTRACT: We propose a novel metric for machine translation evaluation based on neural networks. In the training phrase, we maximize the distance between the similarity scores of high and low-quality hypotheses. Then, the trained neural network is used to evaluate the new hypotheses in the testing phase. The proposed metric can efficiently incorporate lexical and syntactic metrics as features in the network and thus is able to capture different levels of linguistic information. Experiments on WMT-14 show state-of-the-art performance is achieved in two out of five language pairs on the system-level and one on the segment-level. Comparative results are also achieved in the remaining language pairs. |
Student Workshop:
Dec. 5, 2016(13:30–15:10)
No.4, 2nd Floor,Yunan Auditorium(云安会堂2楼4号),
Chair: Yajuan LYU
Return
|
13:30-15:10 |
See Activities
Invited Talk: How to Write an NLP Paper,
Speaker: Yang Liu, Tsinghua University Yajuan LYU
Panel:Career Selection for NLP Students: Academia vs. Industry
|
Evaluation Workshop:
Dec. 5, 2016(13:30–15:30)
No.6, 2nd Floor,Yunan Auditorium(云安会堂2楼6号),
Chair: Xiaojun WAN
Return
|
13:30-13:35 |
Overview of Baidu Cup: Entity Search
Ke Sun
|
13:35-13:47 |
Le Li, Junyi Xu, Weidong Xiao, Shengze Hu and Haiming Tong show abstract hide abstractABSTRACT: Entity search has received abroad attentions and researches that aim to retrieve entities matching the query. Conventional methods focus on entity search task on local dataset, e.g. INEX Wikipedia test collection, where the descriptions of entities are given and relationships between entities are also known. In this paper, we propose an entity search method to handle real-world queries, which need to crawl related descriptions of entities and construct relations between entities manually. By mining historical query records and offline data, our method builds an entity relationship network to model the similarity of entities, and converts the entity search problem to within-network classification problem, which can introduce many novel solutions. Then we use the entity relationship based approach as an offline solution and external knowledge based approach as an online solution to build an ensemble classifier for handling entity search problem. Comprehensive experiments on real-world dataset demonstrate that our method can deal with entity search task effectively and obtain satisfactory performance. |
13:47-13:59 | Kerui Min, Chenghao Dong, Shiyuan Cai and Jianhao Chen show abstract hide abstractABSTRACT: As a sub-field of information retrieval, entity search which answers users' queries by entities, is very useful across various vertical domains. In this paper we propose a flexible and sentiment-aware framework for entity search. Our approach achieved the average MAP score of 0.7044 in the competition of NLPCC Baidu Challenge 2016, obtained the 3rd place among 174 teams. |
13:59-14:11 | Teng Wang, Xun Ma, Pengyan Sun and Zhian Dong show abstract hide abstractABSTRACT: For a given query, searching for entities that conform to the description facts in the given set, in view of this goal, this paper proposes a matching method based on classification and semantic extension. The algorithm firstly to classify the query string into three categories, and extract the key word of different categories of query word .Then the keyword is extended to get the matching word set based on the word2vec word vector model. At last we calculate the score of every entity by the weighted matching method and get results according to the score ranking. After the experiment, the method get the correct rate of 63.2%, which has good applicability, and to a certain extent, it reduces the retrieval failure rate due to the query of the spoken language and diversification. |
14:11-14:23 |
A deep feature fusion ranking method for entity search
Lei Yu
|
14:23-14:28 |
Overview of Sports News Generation from Live Webcast Scripts
Xiaojun Wan
|
14:28-14:40 | Tang Renjun, Zhang Ke, Na Shenruoyang, Yang Minghao, Zhou Hui, Zhu Qingjie, Zhan Yongsong and Tao Jianhua show abstract hide abstractABSTRACT: Challenges exist in the field of sports news generation automatically from webcast that (1) finding hot events and sentences accurately; (2) organizing the selected sentences with highly readability. This paper proposes a framework to generate sports news automatically. First, to obtain accurate hot events and sen-tences, we design a neural network to predict the probabilities that each statement in live webcast script appears in the writing news, where the inputs of the neural network are weighed word vectors obtained from football keywords dictionary, and the outputs the similarity of statements in training live webcast script and sen-tences in training news. In this way, the “good” sentences selected from webcast contribute to the semi-finished sport news. To make the generated news to be possibly similar to human writing, we adopt idioms often appeared in football game to describe or summarize the games’ development or turns between the se-lected sentences, and come into being the final sport news. The proposed frame-work are validated on the training and test data set proved by “Sports News Gen-eration from Live Webcast scripts” task of NLPCC 2016, the experiments show that the proposed method present good performance. |
14:40-14:52 | Liya Zhu, Wenchao Wang, Yujing Chen, Xueqiang Lv and Jianshe Zhou Zhou show abstract hide abstractABSTRACT: In order to enable automatic generation of sports news, in this paper, we propose an extraction method to extract summary sentences from live sports text. After analyzing the characteristics of live sports text, we regard extraction of summary sentence as the sequence tagging problem, and decide to use Con-ditional Random Fields (CRFs) as the extraction model. Firstly, we expend the correlated words of keywords using word2vec. Then, we select positive corre-lated words, negative correlated words, time and the window of score changes as features to train the model and extract summary sentences. This method get good results on the evaluation indicators of ROUGE-1, GOUGE-2 and ROUGE-SU4. And it shows that this method has a meaningful influence on au-tomatic summarization and automatic generation of sports news. |
14:52-14:57 |
Overview of Open Domain Chinese Question Answering
Nan Duan
|
14:57-15:09 | Yuxuan Lai, Yang Lin, Jiahao Chen, Yansong Feng and Dongyan Zhao show abstract hide abstractABSTRACT: Aiming at the task of open domain question answering based on knowledge base in NLP&CC 2016, we propose a SPE (subject predicate extraction) algorithm which can automatically extract a subject-predicate pair from a simple question and translate it to a KB query. A novel method based on word vector similarity and predicate attention is used to score the candidate predicate after a simple topic entity linking method. Our approach achieved the F1-score of 82.47\% on test data which obtained the first place in the contest of NLP&CC 2016 Shared Task 2 (KBQA sub-task). Furthermore, There are also a series of experiments and comprehensive error analysis which can show the properties and defects of the new data set. |
15:09-15:21 | Jian Fu and Xipeng Qiu show abstract hide abstractABSTRACT: Document-based Question Answering aims to compute the similarity or relevance between two texts: question and answer. It is a typical and core task and considered as a touchstone of natural language understanding. In this article, we present a convolutional neural network based architecture to learn feature representations of each question-answer pair and compute its match score. By taking the interaction and attention between question and answer into consideration, as well as word overlap indices, the empirical study on Chinese Open-Domain Question Answering (DBQA) Task (document-based) demonstrates the efficacy of the proposed model, which achieves the best result on NLPCC-ICCPOL 2016 Shared Task on DBQA. |
NLP Applications I:
Dec. 5, 2016(15:30–17:10)
No.1, 2nd Floor,Yunan Auditorium(云安会堂2楼1号),
Chair: Sujian LI
Return
|
15:30-15:50 | Li Dong, Zhongqing Wang and Deyi Xiong show abstract hide abstractABSTRACT: 摘要 股票预测是金融领域关注的焦点。以往,人们普遍采用技术指标进行预测。而随着近年来研究的深入,人们发现大众情感也能够反映股票市场的变化,利用它进行预测能取得相对理想的结果。本文提出一种新的预测方法,利用支持向量回归将从新浪微博中抽取文本信息(词信息,情感词信息和情感分类信息)和当日收集到的技术指标相结合构建模型进行预测。其中,我们利用情感词典标识情感词,使用情感词典和SVM两种分析方式抽取博文情感。实验表明本文提出的方法确实能获得较为理想的预测结果。
关键词 股票预测;情感分析;支持向量回归;SVR
Abstract Stock market prediction is always the hotspot of financial research. People usually use technique indicators to predicate the future value of stock index in the past. Recent years, people find that public emotion could also reflect the future changes of stock market and use it to predicate can bring us relatively ideal result. In this paper, we propose a new method that uses Support Vector Regression to construct predicate model, based on the combination of text information, containing words information, sentiment words information and sentiment classify results, extracted from sina microblogs and daily technique indicators. And those sentiment words are all coming from the sentiment lexicon and the public emotion is extracted from microblogs by two separate methods, sentiment lexicon and SVM. The experimental result shows that our method can provide us with an more ideal result.
Keywords Stock market prediction; Sentiment analysis; Support vector regression; SVR |
15:50-16:10 | Xueqiang Lv, Yujing Chen, Jianshe Zhou and Ning Li show abstract hide abstractABSTRACT: 随着自然语言处理和人工智能技术的发展,计算机自动写作是一个大趋势,通过观察NBA赛事新闻和文字直播的特点,提出了一种基于比分差函数的NBA赛事新闻自动写作的方法。该方法首先根据两只球队的比分差,构建比分差函数,提出了基于比分差函数性质的数据分片算法以及基于比分差函数的数据合成算法,其次对直播数据片进行分类处理,根据数据的类别以及历史NBA赛事的新闻报道构建NBA赛事报道模板库,并以球队和球员的表现为中心,将数据片的信息填入到以构建好的模板中,进而得到一篇自动生成的NBA赛事新闻稿。提出了四种指标来衡量NBA赛事新闻自动写作的质量,通过实验证明该方法是有效的、可行的,并且写作速度较快,能够对赛事新闻撰写者提供有利帮助。 |
16:10-16:30 | Wenhao Ying, Xinyan Xiao, Sujian Li, Yajuan Lv and Zhifang Sui show abstract hide abstractABSTRACT: In search services, providing simple answers to users' queries can help users to get information quickly. To deal with the task, this paper introduces a feature-based query-focused summarization method to extract the simple answer for a query. Convolutional neural network(CNN) is used to learn the semantic representation of a sentence and evaluate the similarity between a candidate answer sentence and a query. Then neural network is trained under the framework of max-margin learning. The experiments verity that our approach to query-focused summarization generates the simple versions of the answers in Baidu Knows with good quality. The substitution of the CNN-based semantic similarity for the bow-based one improves the answer summaries futher. |
16:30-16:50 | Chao Lv, Lili Yao, Yansong Feng and Dongyan Zhao show abstract hide abstractABSTRACT: Collaborative filtering (CF) has been widely employed within
recommender systems in many real-world situations.
The basic assumption of CF is that items liked by the same user would be similar and
users like the same items would share a similar interest.
But it is not always true since the user's interest changes over time.
It should be more reasonable to assume that
if these items are liked by the same user in the same time period,
there is a strong possibility that they are similar,
but the possibility will shrink if the user likes them in a different time period.
In this paper, we propose a long-short interest model (LSIM) based on
the new assumption to improve collaborative filtering.
In special, we introduce a neural network based language model
to extract the sequential features on user's preference over time.
Then, we integrate the sequential features to solve the rating prediction task
in a feature based collaborative filtering framework.
Experimental results on three MovieLens datasets demonstrate that
our approach can achieve the state-of-the-art performance. |
16:50-17:10 | Liang Wang, Sujian Li, Xinyan Xiao and Yajuan Lv show abstract hide abstractABSTRACT: Topic segmentation plays an important role for discourse analysis and document understanding. Previous work mainly focus on unsupervised method for topic segmentation. In this paper, we propose to use bidirectional long short- term memory(BLSTM) model, along with convolutional neural network(CNN) for learning paragraph representation. Besides, we present a novel algorithm based on frequent subsequence mining to automatically discover high quality cue phrases from documents. Experiments show that our proposed model is able to achieve much better performance than strong baselines, and our mined cue phrases are reasonable and effective. Also, this is the first work that investigates the task of topic segmentation for web documents. |
Machine Translation II:
Dec. 5, 2016(15:30–17:10)
No.2, 2nd Floor,Yunan Auditorium(云安会堂2楼2号),
Chair: Tiejun ZHAO
Return
|
15:30-15:50 | Jinying Kong, Yating Yang, Xi Zhou, Lei Wang and Xiao Li show abstract hide abstractABSTRACT: The problem of rare and unknown words is an important issue in Uyghur-Chinese machine translation, especially using neural machine translation model. We propose a novel way to deal with the rare and unknown words. Based on neural machine translation of using pointers over input sequence, our approach which consists of preprocess and post-process can be used in all neural machine translation model. Pre-process modify the Uyghur-Chinese corpus to extend the ability of pointer network, and the post- process retranslating the raw translation by a phrase-based machine translation model or a wordlist. Experiment show that neural machine translation model used the approach proposed by this paper get a higher BLEU score than the phrase-based model in Uyghur-Chinese MT. |
15:50-16:10 | Nan Wang, Jin'an Xu, Fang Ming, Yufeng Chen and Yujie Zhang show abstract hide abstractABSTRACT: 摘要:针对不同语种的被动和可能语态的句法结构差异影响机器翻译的翻译质量的问题,本文提出融合语态特征的最大熵翻译模型。首先从日语端分出被动态、可能态和其它语态,其次从英语端对被动和可能态进一步分类,抽取双语特征训练最大熵规则分类模型,将语态特征融合入对数线性模型中以改善翻译模型。提高解码器在翻译被动态与可能态时规则选择的准确性。实验结果表明,该方法可以有效改善日英统计机器翻译的句法结构调序和词汇翻译,提升被动态和可能态句子的翻译质量。
Abstract: The voice of each language usually keeps different syntactic structure. In machine translation, it causes the problem of relatively low translation quality. For resolving this problem, in this paper, an approach is proposed by integrating voice features into hierarchical phrase based (HPB) models. In our method, corpus is firstly classified into three categories from Japanese side: passive voice, potential voice and others. Secondly, passive and potential sentences are classified into several groups according to the characteristics of English to build maximum entropy models for rules. Finally, to integrate bilingual voice features into log linear model for improving translation results. In Japanese to English translation task, large scale experiment shows that the proposed approach achieves better performance than baseline. It shows that our proposed method can not only improve the problem of long distance reordering but also improve translation quality of both passive and active voice test sets. |
16:10-16:30 | Enting Gao and Xiangyu Duan show abstract hide abstractABSTRACT: 音译是指按照发音方式将源语言改写成目标语言的过程。机器音译方法主要分为传统的基于统计的方法和当今较为流行的基于深度神经网络的方法。本文主要针对这两种方法进行了对比研究,每种方法都使用了两种典型系统进行研究。实验结果显示,基于统计的方法和基于深度神经网络的方法所取得的音译质量在评测指标上相当,但在具体音译结果上各系统间呈现不一致的输出。本文使用了系统融合的方法以平衡各系统间的优势,实验结果显示系统融合的方法显著提升了单系统的音译质量。 |
16:30-16:50 | Lilin Zhang, Maoxi Li, Wenyan Xiao, Jianyi Wan and Mingwen Wang show abstract hide abstractABSTRACT: 利用复述知识增强机器译文和人工参考译文中同义词和近义词的匹配,是机器翻译自动评价中一个关键问题。现有的方法均从通用语料中抽取复述,然后将其应用在特定领域机器译文自动评价任务中,这将导致复述匹配偏差。针对此问题,本文提出了抽取与测试领域相关的复述来提高机器译文自动评价的方法。首先将通用单语训练语料进行聚类,并利用改进的M-L方法过滤得到特定领域训练语料,然后在训练语料中利用Markov网络模型抽取特定领域复述表,最后将此复述表应用在机器译文自动评价中以提高同义词和近义词的匹配精度。在WMT’14 Metrics task和WMT’15 Metrics task数据集上的实验结果表明,利用领域知识抽取的复述提高了自动评价方法METEOR和TER与人工评价的相关性。 |
16:50-17:10 | Wenhe Feng, Yi Yang, Yancui Li and Han Ren show abstract hide abstractABSTRACT: This paper annotates the English corresponding units of Chinese clauses in Chinese-English translation and statistically analyzes them. Firstly, based on Chinese clause segmentation, we segment English target text into corresponding units (clause) to get a Chinese-to-English clause-aligned parallel corpus. Then, we annotate the grammatical properties of the English corresponding clauses in the corpus. Finally, we find the distribution characteristics of grammatical properties of English corresponding clauses by statistically analyzing the annotated corpus: there are more clauses (1631,74.41%) than sentences (561,25.59%); there are more major clauses (1719,78.42%) than subordinate clauses (473,21.58%); there are more adverbial clauses (392,82.88%) than attributive clauses (81,17.12%) and more non-defining clauses (358,75.69%) than restrictive relative clauses (115,24.31%) in subordinate clauses; and there are more simple clauses(1142,52.1%) than coordinate clauses(1050,47.9%). |
Student Workshop:
Dec. 5, 2016(15:30–17:10)
No.4, 2nd Floor,Yunan Auditorium(云安会堂2楼4号),
Chair: Yajuan LYU
Return
|
16:00-16:06 | Dongxu Zhang, Tianyi Luo and Dong Wang show abstract hide abstractABSTRACT: Bayesian models and neural models have demonstrated their respective advantage in topic modeling. Motivated by the dark knowledge transfer approach proposed by G. Hinton et al, we present a novel method that combines the advantages of the two model families. Particularly, we present a transfer learning method that uses LDA to supervise the training of a deep neural network (DNN), so that the DNN can approximate the LDA inference with less computation. Our experimental results show that by transfer learning, a simple DNN can approximate the topic distribution produced by LDA pretty well, and deliver competitive performance as LDA on document classification, with much faster computation. |
16:06-16:12 | Dongxu Zhang and Dong Wang show abstract hide abstractABSTRACT: Convolutional neural networks (CNN) have delivered competitive performance on relation classification, without tedious feature engineering. A particular shortcoming of CNN, however, is that it is less powerful in modeling long-span relations. This paper presents a model based on recurrent neural networks (RNN) and compares the capabilities of CNN and RNN on the relation classification task. We conducted a thorough comparative study on two databases: one is the popular SemEval-2010 Task 8 dataset, and the other is the KBP37 dataset we designed based on MIML-RE, with the goal of learning and testing complex relations. The experimental results strongly indicate that even with a simple RNN structure, the model can deliver much better performance than CNN, particularly for long-span relations. |
16:12-16:18 | Shuangshuang Zhou, Jin'an Xu, Yufeng Chen and Yujie Zhang show abstract hide abstractABSTRACT: 摘 要: 结合微博新词的构词规则自由度大和极其复杂的特点,针对传统的C/NC-value方法抽取的结果新词边界的识别准确率不高,以及低频微博新词无法正确识别的问题,提出了一种融合人工启发式规则、C/NC-value改进算法和条件随机场(conditional random field,CRF)模型的微博新词抽取方法。一方面,人工启发式规则是指对微博新词的分类和归纳总结,并从微博新词构词的词性、字符类别和表意符号等角度设计的微博新词的构词规则。另一方面,改进的C/NC-value方法通过引入词频、邻接熵和互信息等统计量来重构NC-value目标函数,并使用条件随机场(CRF)模型训练和识别新词,最终达到提高新词边界识别准确率和低频新词识别精度的目的。实验结果显示,与传统方法相比,本文提出的方法能有效地提高微博新词识别的F值。
Abstract: The formation rules of Microblog new words are extremely complex with high degree of dispersion, and the extracted results by using traditional C/NC-value method have several problems, which include relatively low accuracy of the boundary of identified new words and lower detection accuracy of new words with lower frequency. For solving these problems, this paper proposed a method of integrating heuristic rules, modified C/NC-value method and conditional random field (CRF) model. On one hand, heuristic rules include the abstracted information of classification and inductive rules according to focusing on the components of Microblog new words, and rules were artificially summarized by using POS, character types and symbols through observing a large number of Microblog documents. On the other hand, for improving the accuracy of the boundary of identified new words and the detection accuracy of new words with lower frequency, traditional C/NC-value method was modified by merging the information of word frequency, branch entropy, mutual information and other statistical features to reconstruct the objective function, and finally CRF model was used to train and detect new words. Experimental results show that our proposed method achieves good performance of new words detection. |
16:18-16:24 | Shuowang Zhang, Chunping Ouyang, Xiaohua Yang, YOngbin Liu and Zhiming Liu show abstract hide abstractABSTRACT: 摘 要: 词语语义相似度计算被广泛的应用在自然语言处理,文本系统,翻译系统当中,针对网络知识迅速扩张的背景下《知网》数据词典对词语的描述与人们对词汇的认知不匹配的问题,提出了一种改进的基于《知网》的词汇概念语义相义似度计算方法,并引入搜索引擎方法对词汇语义相似度计算结果进行修正,实验证明与单纯基于《知网》的方法相比,相关系数提升了17%,说明本文所提出的方法更符合人们对词汇的主观认知。
摘 要: Word semantic similarity computation has been widely used in Natural Language Processing, the text system, and the translation system, Point at the problem that the description of words that HowNet data dictionary make are not match with people’s Subjective consciousness in the background of the rapid expansion of network knowledge, we proposed an improved method of word semantic similarity computation based on the HowNet, and introduced the method of search engine to modify the results of word semantic similarity computation. The Experiment show that the correlation coefficient increased by 17% compare whit the traditional method based on HowNet and proved that our method is more conform to the people’s subjective perception of vocabulary. |
16:24-16:30 | Leyuan Qu, Yanlu Xie and Jinsong Zhang show abstract hide abstractABSTRACT: 随着二语习得需求的增加,计算机辅助发音训练成为当前研究的热点之一。本文基于二分类检测框架,从改变边界切分方式和优化特征选择两个方面进一步完善送气不足偏误检测系统。首先,音段边界切分从人工标注改为计算机自动对齐;其次,获得音段边界后,使用基于landmark、自动聚类和曲线拟合系数的方法优化MFCC参数的选择。实验结果表明,该研究提出的改进方法获得更高检测正确率的同时,降低了错误接受率。此外,基于landmark方法的检测性能可以达到使用人工标注边界时的检测水平,获得了90.4%的检测正确率。 |
16:30-16:36 | Bingling Zhou and Dekuan Xu show abstract hide abstractABSTRACT: 本文通过复杂网络的主要参数,即聚类系数、直径、半径、最短路径数、平均路径长度、平均邻点数、节点数、网络密度,探讨了《红楼梦》前八十回与后四十回字同现网络间的相似与差距,然后用其中的主要参数进行了聚类分析,探讨了《红楼梦》前八十回与后四十回的作者归属。数据显示前八十回和后四十回很可能出自不同作者之手。 |
16:36-16:42 |
基于语义相似度的评论观点抽取
Weikang Rui; Kai Liu
|
16:42-16:48 | Lei CAO, Weibin YIN, Qinyao SUN, Zhi WANG, Chongchong YU and Daowei LI show abstract hide abstractABSTRACT: Abstract:Abstract: The purpose of establishing an endangered language spoken corpus is to preserve the endangered language totally, especially its vitality and the local culture, for studying and researching. The preservation of endangered language spoken corpus includes original voice files, international phonetic alphabet annotation, Chinese translation annotation. The paper takes Lizu language as an example, and studies the establishment of endangered languages spoken corpus systematically. Besides automatic word segmentation and keyword extraction of Lizu annotation corpus is realized, which is provided for the establishment of universal endangered language corpus subsequently as an example. |
16:48-16:54 | Yatu Ji, Yila Su and Baoyuan Dou show abstract hide abstractABSTRACT: 汉蒙机器翻译的研究对于汉蒙信息交流有着至关重要的作用。针对蒙古语词根、词干连接词缀的构形特点,主宾谓(Subject Object Verb,SOV)型语序特征,建立了词汇切分模型。利用开源的Moses解码器,结合汉蒙调序模型,提出了一种以蒙古语为词干词缀粒度的汉蒙统计机器翻译方法,并针对蒙古语黏着语特性,通过切分词干词缀,细化蒙古语词的粒度,将翻译过程建模在更小的语言单位之上,减少蒙古语词的数据稀疏问题,并将隐马尔科夫模型(Hidden Markov Model ,HMM)融入到蒙古语的语言模型中。基于开放语料的实验结果表明,词干词缀的切分能有效的增加对齐距离,相比基于词粒度的翻译模型,翻译性能有了显著的提高。
Chinese Mongolian machine translation has played a crucial role on the exchange of information. Taking configuration characteristics of affix which is connected with root and stem, and SOV order feature into consideration, this paper builds a language tokenization model. Using open source Moses decoder and combining with Chinese and Mongolian reordering model, this paper incorporate the Hidden Markov Model (HMM) model into Mongolian language model by Split stem and affixes, refinement Mongolian granularity, the translation process is modeled on smaller language units,it reduce data sparseness of Mongolian. The experimental results based on open corpus show that the segmentation stem affixes can effectively increase the alignment distance, compared translation model based on word granularity, translation performance has been significantly improved. |
Evaluation Workshop:
Dec. 5, 2016(15:30–17:10)
No.6, 2nd Floor,Yunan Auditorium(云安会堂2楼6号),
Chair: Xiaojun WAN
Return
|
15:40-15:45 |
Overview of Chinese Word Segmentation for Micro-blog Texts
Xipeng Qiu
|
15:45-15:57 | Qingrong Xia, Zhenghua Li, Jiayuan Chao and Min Zhang show abstract hide abstractABSTRACT: This paper describes our system designed for the NLPCC 2016 shared task on word segmentation on micro-blog texts (i.e., Weibo). We treat word segmentation as a character-wise sequence labeling problem, and explore two directions to enhance our CRF-based baseline. First, we employ a large-scale external lexicon for constructing extra lexicon features in the model, which is proven to be extremely useful. Second, we exploit two heterogeneous datasets, i.e., Penn Chinese Treebank 7 (CTB7) and People Daily (PD) to help word segmentation on Weibo. We adopt two mainstream approaches, i.e., the guide-feature based approach and the recently proposed coupled sequence labeling approach. We combine the above techniques in different ways and obtain four well-performing models. Finally, we merge the outputs of the four models and obtain the final results via Viterbi-based re-decoding. On the test data of Weibo, our proposed approach outperforms the baseline by 95.63 − 94.24 = 1.39% in terms of F1 score. Our final system rank the first place among five participants in the open track in terms of F1 score, and is also the best among all 28 submissions. All codes, experiment configurations, and the external lexicon are released at http://hlt.suda.edu.cn/~zhli. |
15:57-16:09 | Qianrong Zhou, Long Ma, Zhenyu Zheng, Yue Wang and Xiaojie Wang show abstract hide abstractABSTRACT: In this paper, we present a Long Short-Term Memory (LSTM) based model for the task of Chinese Weibo word segmentation. The model adopts a LSTM layer to capture long-range dependencies in sentence and learn the underlying patterns. In order to infer the optimal tag path, we introduce a transition score matrix for jumping between tags of successive characters. Integrated with some unsupervised features, the performance of the model is further improved. Finally, our model achieves a weighted F1-score of 0.8044 on close track, 0.8298 on the semi-open track. |
16:09-16:14 |
Overview of Chinese Word Similarity Measurement
Yunfang Wu
|
16:14-16:26 | Shaoru Guo, Yong Guan, Ru Li and Qi Zhang show abstract hide abstractABSTRACT: Chinese word similarity computing is a fundamental task for natural language processing. This paper presents a method to calculate the similarity between Chinese words based on combination strategy. We apply Baidubaike to train Word2Vector model, and then integrate different methods, semantic Dictionary-based method, Word2Vector-based method and Chinese FrameNet (CFN)-based method, to calculate the semantic similarity between Chinese words. The semantic Dictionary-based method includes dictionaries such as HowNet, DaCilin, Tongyici Cilin (Extended) and Antonym. The experiments are performed on 500 pairs of words and the Spearman correlation coefficient of test data is 0.524, which shows that the proposed method is feasible and effective. |
16:26-16:38 | Jiahuan Pei, Cong Zhang, Degen Huang and Jianjun Ma show abstract hide abstractABSTRACT: Large corpus-based embedding methods have received increasing attention for their flexibility and effectiveness in many NLP tasks including Word Similarity (WS). However, these approaches rely on high-quality corpora and neglect the human’s intelligence contained in semantic resources such as Tongyici Cilin and Hownet. This paper proposes a novel framework for measuring the Chinese word similarity
by combining word embedding and Tongyici Cilin. We also utilize retrieval techniques to extend the contexts of word pairs and calculate the similarity scores to weakly supervise the selection of a better result. In the Chinese Lexical Similarity Computation (CLSC) shared task, we rank No.2 with the result of 0.457/0.455 of Spearman/Pearson rank correlation coefficient. After the submission, we boost the embedding model by merging an English model into the Chinese one and learning the co-occurrence sequence via LSTM networks. Our final results are 0.541/0.514, which outperform the state-of-the-art performance to the
best of our knowledge. |
16:38-16:43 |
Overview of Stance Detection in Chinese Microblogs
Ruifeng Xu
|
16:43-16:55 | jiaming xu, Suncong Zheng, Jing Shi, Yiqun Yao and Bo Xu show abstract hide abstractABSTRACT: Stance detection is the task of automatically determining the author's favorability towards a given target. However, the target may not be explicitly mentioned in the text and even someone may refer some positive opinions to against the target, which make the task more difficult. In this paper, we describe an ensemble framework which integrates various feature sets and classification methods, and does not consist any handcrafted templates or rules to help stance detection. We submit our solution to NLPCC 2016 shared task: Detecting Stance in Chinese Weibo (Task A), which is a supervised task towards five targets. The official results show that our solution of the team "CBrain" achieves one 1st place and one 2nd place on these targets, and the overall ranking is 4th out of 16 teams. Our code is available at https://github.com/jacoxu/2016NLPCC_Stance_Detection. |
16:55-17:07 | Liran Liu, Shi Feng, Daling Wang and Yifei Zhang show abstract hide abstractABSTRACT: Nowadays, more and more people are willing to express their opinions and attitudes in the microblog platform. Stance detection refers to the task that judging whether the author of the text is in favor of or against the given target. Most of the existing literatures are for the debates or online conversations, which have adequate context for inferring the authors’ stances. However, for microblog stance detection task, we have to figure out the stance of the author only based on the unique and separate microblog, which set new obstacles for this task. In this paper, we conduct a comprehensive empirical study on microblog stance detection using supervised and semi-supervised machine learning methods. Different unbalanced data processing strategies and classifiers, such as Linear SVM, Naive Bayes, Random Forest, are compared using NLPCC2016 Stance Detection Evaluation Task dataset. Experiment results show that the method based on ensemble learning and SMOTE2 unbalanced processing with sentiment word features outperforms the best submission result in NLPCC2016 Evaluation Task. |
Semantics:
Dec. 6, 2016(13:30–15:30)
No.1, 2nd Floor,Yunan Auditorium(云安会堂2楼1号),
Chair: Chin-Yew LIN
Return
|
13:30-13:50 | Qianlong Du, Chengqing Zong and Keh-Yih Su show abstract hide abstractABSTRACT: A novel word sense disambiguation (WSD) discriminative model is proposed in this paper to handle long distance sense dependency and multi-reference lexicon dependency (i.e., the sense of a lexicon might depend on several other non-local lexicons under the same subtree) within the sentence. Many WSD systems only adopt local context to independently decide the sense of each lexicon in a sen-tence. However, the sense of a target word actually also depends on those struc-ture related sense/lexicons that might be far away from it. Therefore, we propose a supervised approach which integrates structural context (for long distance sense dependency and multi-reference lexicon dependency) with the local context (for local dependency) to handle the problems mentioned above. As the result, the sense of each word is decided not only based on the local lexicons, but also based on various reference sense/lexicons (might be non-local) specified by all its asso-ciated syntactic subtrees. Experimental results show that the proposed approach significantly outperforms other state-of-art WSD systems. |
13:50-14:10 | Liwei Chen, Yansong Feng and Dongyan Zhao show abstract hide abstractABSTRACT: Word sense representation is important in the tasks of information retrieval
(IR). Existing lexical databases, e.g.,WordNet, and automated word sense
representing approaches often use only one view to represent a word, and may not
work well in the tasks which are sensitive to the contexts, e.g., query rewriting. In this paper, we propose a new framework to represent a word sense simultaneously
in two views, explanation view and context view. We further propose an novel
method to automatically learn such representations from large scale of query logs.
Experimental results show that our new sense representations can better handle
word substitutions in a query rewriting task. |
14:10-14:30 | An Yang and Sujian Li show abstract hide abstractABSTRACT: For domain-specific texts, word sense disambiguation (WSD) on them may be helpful for computers to comprehend and mine domain-related knowledge. However, lack of annotated corpus makes the task much more difficult. In this submission, an effective domain-specific WSD method is presented which uses domain keywords and word vector model built from unlabeled data. The submission includes a PDF file and a zip source file containing the paper in doc format. |
14:30-14:50 | Xuanyi Liao and Guang Chen show abstract hide abstractABSTRACT: This paper intend to present an approach to analyse the change of word meaning based on word embedding, which is a more general method to quantize words than before. Through analyzing the similar words and clustering in different period, semantic change could be detected. We analysed the trend of semantic change through density clustering method called DBSCAN. Statics and data visualization is also included to make the result more clear. Some words like `gay', `mouse' are traced as case to prove this approach works. We also compared the context words and similar words on semantic presentation. |
14:50-15:10 | Qi Li, Tianshi Li and Baobao Chang show abstract hide abstractABSTRACT: Word embeddings play a significant role in many modern NLP systems. Since learning one representation per word is problematic for polysemous words and homonymous words, researchers propose to use one embedding per word sense. Their approaches mainly train word sense embeddings on a corpus. In this paper, we propose to use word sense definitions to learn one embedding per word sense. Experimental results on word similarity tasks and a word sense disambiguation task show that word sense embeddings produced by our approach are of high quality. |
Social Media:
Dec. 6, 2016(13:30–15:30)
No.2, 2nd Floor,Yunan Auditorium(云安会堂2楼2号),
Chair: Daling WANG
Return
|
13:30-13:50 | Yang Liu, Xuan Chen, Sujian Li and Liang Wang show abstract hide abstractABSTRACT: On the Twitter platform, an effective followee recommendation system is helpful to connecting users in a satisfactory manner.
Topological relations and tweets content are two main factors considered in a followee recommendation system.
However, how to combine these two kinds of information in a uniform framework is still an open problem.
In this paper, we propose to combine deep learning techniques and collaborative information to explore the user representations latent behind the topology and content.
Over two kinds of user representations (i.e., topology representation and content representation), we design an adaptive layer to dynamically leverage the contribution of topology and content to recommending followees.
Experiments on a real-world Twitter dataset show that our proposed model provides more satisfying recommendation results than state-of-the-art methods. |
13:50-14:10 | Jie Jiang and Rui Xia show abstract hide abstractABSTRACT: 微博情感分析是近几年自然语言处理领域研究的热点问题。主流的文本情感分析主要分为规则方法和机器学习方法。一方面,这两类方法各自存在一些缺点和不足,需要找到一种融合算法有效将两者结合起来;另一方面,业界缺少针对中文微博的语义规则方法及其与机器学习融合方法的研究。针对上述状况,本文提出了一种机器学习与语义规则相融合的微博情感分类方法,将语义规则方法得到的多样化情感信息进行转化扩展以后嵌入到机器学习模型的特征空间,并通过分类器集成有效提高了微博情感分类的性能。 |
14:10-14:30 | Liang Wang, Qi Li, Xuan Chen and Sujian Li show abstract hide abstractABSTRACT: The demographic attributes gender and age play an important role for social media applications. Previous studies on gender and age prediction mostly explore efficient features which are labor intensive. In this paper, we propose to use the multi-task convolutional neural network (MTCNN) model for predicting gender and age simultaneously on Chinese microblogs. With MTCNN, we can effectively reduce the burden of feature engineering and explore common and unique representations for both tasks. Experimental results show that our method can significantly outperform the state-of-the-art baselines. |
14:30-14:50 | Chuanming Yu, Bolin Feng, Yuheng Zuo, Baiyun Chen and Lu An show abstract hide abstractABSTRACT: 在网络购物盛行的时代,用户评论成为影响在线消费用户购买决策的重要因素。目前存在一些不法商家、个人为了谋取利益最大化,制造虚假评论信息,对消费者的购买决策产生误导,这使得识别虚假产品评论成为极具研究价值的领域。区别于传统的对评论文本内容研究的方法,本文从评论利益相关者内容与行为特征相结合的角度出发,提出了一种基于个人、群体和商户的主体关系模型(Individual-Group-Merchant Relation Model, IGMRM)。为了验证模型的有效性,选择了93家店铺中9,558个不同IP的97,804条评论数据作为样本数据。实验结果显示IGMRM在识别虚假评论者、存在信用操纵的商铺以及虚假评论者群体的F值分别达到82.62%、59.26%和95.12%。这表明该模型在识别虚假评论者、存在信用操纵的商铺以及虚假评论者群体方面优于传统方法。 |
14:50-15:10 | Beibei Gu, Zhunchen Luo and Xin Wang show abstract hide abstractABSTRACT: Twitter is an important source of information to users for its giant user group and rapid information diffusion but also made it hard to track topics in oceans of tweets. Such situation points the way to consider the task of finding information feeders, a finer-grained user group than domain experts. Information feeders refer to a crowd of topic tracers that share interests in a certain topic and provide related and follow-up information. In this study, we explore a wide range of features to find Twitter users who will tweet more about the topic after a time-point within a machine learning framework. The features are mainly extracted from the user’s history tweets for that we believe user’s tweet decision depends most on his history activities. We considered four feature families: activeness, timeliness, interaction and user profile. From our results, activeness in user’s history data is most useful. Besides that, we concluded people who gain social influence and make quick response to the topic are more likely to post more topic-related tweets. |
Discourse:
Dec. 6, 2016(13:30–15:30)
No.4, 2nd Floor,Yunan Auditorium(云安会堂2楼4号),
Chair: Nianwen XUE
Return
|
13:30-13:50 | Haoran Li, Jiajun Zhang, Yu Zhou and Chengqing Zong show abstract hide abstractABSTRACT: Discourse relations between two text segments play an important role in many natural language processing (NLP) tasks. The connectives strongly indicate the sense of discourse relations, while in fact, there are no connectives in a large proportion of discourse relations, i.e., implicit discourse relations. The key for implicit relation prediction is to correctly model the semantics of the two discourse arguments as well as the contextual interaction between them. To achieve this goal, we propose a multi-view framework that consists of two hierarchies. The first one is the model hierarchy and we propose a neural network based method considering different views. The second one is the feature hierarchy and we learn multi-level distributed representations. We have conducted experiments on the standard benchmark dataset and the results show that compared with several methods our proposed method can achieve the best performance in most cases. |
13:50-14:10 | Fang Kong, Hongling Wang and Guodong Zhou show abstract hide abstractABSTRACT: Discourse parsing is a challenging task and plays a critical role in discourse analysis. Since the release of the Rhetorical Structure Theory Discourse Treebank (RST-DT) and the Penn Discourse Tree- bank (PDTB), the research on English discourse parsing has attracted increasing attention and achieved considerable success in recent years. At the same time, some preliminary research on certain subtasks about dis- course parsing for other languages, such as Chinese, has been conducted. In this paper, the Connective-driven Dependency Treebank (CDTB) cor- pus is introduced. Then an end-to-end Chinese discourse parser to parse free texts into the Connective-driven Dependency Tree (CDT) style is presented. The parser consists of multiple components including elemen- tary discourse unit detector, discourse relation recognizer, discourse parse tree generator and attribution labeler. In particular, attribution labeler determines two attributions (sense and centering) for every non-terminal node in the discourse parse trees. Effective feature sets are proposed for every component respectively. Comprehensive experiments are con- ducted on the Connective-driven Dependency Treebank (CDTB) corpus with an overall F1 score of 20.0%. |
14:10-14:30 | Yanyan Jia, Yansong Feng, Bingfeng Luo, Yuan Ye, Tianyang Liu and Dongyan Zhao show abstract hide abstractABSTRACT: Discourse parsing aims to identify the relationship between different discourse units, where most previous works focus on recovering the constituency structure among discourse units with carefully designed features. In this paper,we propose to exploit Long Short Term Memory (LSTM) to properly represent discourse units, while using as few feature engineering as possible. Our transition based parsing model features a multilayer stack LSTM framework to discover the dependency structures among different units. Experiments on RST Discourse Treebank show that our model can outperform traditional feature based systems in terms of dependency structures, without complicated feature design. When evaluated in discourse constituency, our parser can also achieve promising performance compared to the state-of-the-art constituency discourse parsers. |
14:30-14:50 | Ziyi Yang, Zhengxian Gong, Fandong Kong and Guodong Zhou show abstract hide abstractABSTRACT: In this paper, a bilingual approach based on a comparable corpus is proposed to better detect and to resolve Chinese zero pronouns. On the basis of previous work, the concept of English equivalent sentence is defined firstly. Then the equivalent sentence is employed to redefine the distance between sentences and to extract bilingual word alignment features. In this way, both zero pronoun detection and resolution of the baseline system from bilingual perspective are improved. The experiments conducted on the OntoNotes5.0 corpus show that our proposed approach can significantly outperform the state-of-the-art system. |
14:50-15:10 | Xiaohan She, Ping Jian, Pengcheng Zhang and Heyan Huang show abstract hide abstractABSTRACT: This paper presents a mutual learning method using hierarchical deep semantics for the classification of implicit discourse relations in English. With the absence of explicit discourse markers, traditional discourse techniques mainly concentrate on discrete linguistic features in this task, which always leads to data sparse prob-lem. To relieve this problem, we propose a mutual learning neural model which makes use of multilevel semantic information together, including the distribution of implicit discourse relations, the semantics of arguments and the co-occurrence of words. During the training process, the predicted target of the model which is the probability of the discourse relation type, and the distributed representation of semantic components are learnt jointly and optimized mutually. The results of both binary and multiclass identification show that this method outperforms pre-vious works since the mutual learning strategy can distinguish Expansion type from the others efficiently. |
ICCPOL Panel:
Dec. 6, 2016(13:30–15:30)
No.6, 2nd Floor,Yunan Auditorium(云安会堂2楼6号),
Chair: Bill WANG
Return
|
13:30-15:30 |
|
Knowledge Acquisition:
Dec. 6, 2016(15:30–16:50)
No.1, 2nd Floor,Yunan Auditorium(云安会堂2楼1号),
Chair: Guilin QI
Return
|
15:30-15:50 | Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Masanori Hattori and Hui Di show abstract hide abstractABSTRACT: State-of-the-art systems of Chinese Named Entity Recognition (CNER) require large amounts of hand-crafted features and domain-specific knowledge to achieve high performance. In this paper, we apply a bidirectional LSTM-CRF neural network that utilizes both character-level and radical-level representations. We are the first to use character-based BLSTM-CRF neural architecture for CNER. By contrasting the results of different variants of LSTM blocks, we find the most suitable LSTM block for CNER. We are also the first to investigate Chinese radical-level representations in BLSTM-CRF architecture and get better performance without carefully designed features. We evaluate our system on the third SIGHAN Bakeoff MSRA data set for simplfied CNER task and achieve state-of-the-art performance 90.95% F1. |
15:50-16:10 | Haihua Xie, Xiaoqing Lu, Zhi TANG and Xiaojun HUANG show abstract hide abstractABSTRACT: Entity mixture in a knowledge base refers to the situation that some attributes of an entity are mistaken for another entity’s, and it often occurs among homonymous entities which have the same value of the attribute “Name”. Elimination of entity mixture is critical to ensure data accuracy and validity for knowledge based services. However, current researches on entity disambiguation mainly focuses on determining the identity of entities mentioned in text during information extraction for building a knowledge base, while little work has been done to verify the information in a built knowledge base. In this paper, we propose a generic method to detect mixed homonymous entities in a knowledge base using hierarchical clustering. The principle of our methodology to differentiate entities is detecting the inconsistence of their attributes based on analysis of the appearance distribution of their attribute values in documents of a common corpus. An experiment on a data set of industry applications has been conducted to demonstrate the workflow of performing the clustering and detecting mixed entities in a knowledge base using our methodology. |
16:10-16:30 | Bingfeng Luo, Yansong Feng, Zheng Wang and Dongyan Zhao show abstract hide abstractABSTRACT: In this paper, we deal with the task of extracting first order temporal facts from free text. This task is a subtask of relation extraction and it aims at extracting relations between entity and time. Currently, the field of relation extraction mainly focuses on extracting relations between entities. However, we observe that the multi-granular nature of time expressions can help us divide the dataset constructed by distant supervision to reliable and less reliable subsets, which can help to improve the extraction results on relations between entity and time. We accordingly contribute the first dataset focusing on the first order temporal fact extraction task using distant supervision. To fully utilize both the reliable and the less reliable data, we propose to use curriculum learning to rearrange the training procedure, label dropout to make the model be more conservative about less reliable data, and instance attention to help the model distinguish important instances from unimportant ones. Experiments show that these methods help the model outperform the model trained purely on the reliable dataset as well as the model trained on the dataset where all subsets are mixed together. |
16:30-16:50 | Tingming Lu, Man Zhu and Zhiqiang Gao show abstract hide abstractABSTRACT: Annotated named entity corpora play a significant role in many natural language processing applications. However, annotation by humans is time-consuming and costly. In this paper, we propose a high recall pre-annotator which combines multiple existing named entity taggers based on ensemble learning, to reduce the number of annotations that humans have to add. In addition, annotations are categorized into normal annotations and candidate annotations based on their estimated confidence, to reduce the number of human corrective actions as well as the total annotation time. The experiment results show that our approach outperforms the baseline methods in reduction of annotation time without loss in annotation performance (in terms of F-measure). |
QA and Events:
Dec. 6, 2016(15:30–16:50)
No.2, 2nd Floor,Yunan Auditorium(云安会堂2楼2号),
Chair: Muyun YANG
Return
|
15:30-15:50 | Ying Zeng, Honghui Yang, Yansong Feng, Zheng Wang and Dongyan Zhao show abstract hide abstractABSTRACT: Chinese event extraction is a challenging task in information extraction. Previous approaches highly depend on sophisticated feature engineering and complicated natural language processing (NLP) tools. In this paper, we first come up with the language specific issue in Chinese event extraction, and then propose a convolution bidirectional LSTM neural network that combines LSTM and CNN to capture both sentence-level and lexical information without any hand-craft features. Experiments on ACE 2005 dataset show that our approaches can achieve competitive performances in both trigger labeling and argument role labeling. |
15:50-16:10 | Zhengkuan Zhang, Weiran Xu and Qianqian Chen show abstract hide abstractABSTRACT: Traditional approaches to the task of ACE event extraction are either joint model with elaborately designed features which may lead to generalization and data-sparsity problems, or word-embedding model based on a two-stage, multi-class classification architecture, which suffer from error propagation since event triggers and arguments are predicted in isolation. This paper proposes a novel event-extraction method, which not only extract triggers and arguments simultaneously, but also adopt a framework based on convolutional neural networks (CNNs) to extract features automatically. However, CNNs can only capture sentence-level features, so we proposed the skip-window convolution neural networks (S-CNNs) to extract global structured features, which effectively capture the global dependencies of every token in the sentence. The experimental results show that our approach outperforms other state-of-the-art methods. |
16:10-16:30 | Zhao Yan, Nan Duan, Ming Zhou, Zhoujun Li and Jianshe Zhou show abstract hide abstractABSTRACT: We present an open domain topic prediction model for the answer selection task. Different from previous unsupervised topic modeling methods, we automatically extract high quality and large scale |
16:30-16:50 | Zhiwen Xie, Zhao Zeng, Guangyou Zhou and Tingting He show abstract hide abstractABSTRACT: This paper focuses on the task of knowledge-based question
answering (KBQA). KBQA aims to match the questions with the structured semantics in knowledge base. In this paper, we propose a two-stage
method. Firstly, we propose a topic entity extraction model (TEEM) to
extract topic entities in questions, which does not rely on hand-crafted
features or linguistic tools. We extract topic entities in questions with
the TEEM and then search the knowledge triples which are related to
the topic entities from the knowledge base as the candidate knowledge
triples. Then, we apply Deep Structured Semantic Models based on convolutional neural network and bidirectional long short-term memory to
match questions and predicates in the candidate knowledge triples. To
obtain better training dataset, we use an iterative approach to retrieve
the knowledge triples from the knowledge base. The evaluation result
shows that our system achieves an AverageF1 measure of 79.57% on test
dataset. |
NLP Applications II:
Dec. 6, 2016(15:30–16:50)
No.4, 2nd Floor,Yunan Auditorium(云安会堂2楼4号),
Chair: Zhi TANG
Return
|
15:30-15:50 | Boli Wang, xiaodong shi and jinsong su show abstract hide abstractABSTRACT: 古籍大多未进行断句,利用自然语言处理技术进行自动断句,不仅能降低现代人阅读古文的难度,也是进行古籍分词等研究所必要的前序工作。本文提出了一种基于循环神经网络的古文自动断句方法。该方法采用一种基于GRU的双向循环神经网络进行古文断句,并在神经网络输出概率的基础上进一步引入解码算法以提高断句准确率。实验结果表明,该方法能取得比传统方法更高的断句F1值。 |
15:50-16:10 | Jian guo Xiao and Qingsheng LI show abstract hide abstractABSTRACT: 本文提出了一种基于汉字的结构和风格的字形生成模型。该模型将汉字字形抽象为汉字结构和汉字的风格两种模式,并在结构当中将汉字笔画抽象成为连续的笔元,通过笔元的特征点构造笔元向量、径向量、弦向量和轭向量的方法来进行笔画风格的重建,动态产生了可用于True type个性化汉字字形设计的字形;实现了汉字字形的Web存储和在客户端的特征字形输出;克服了现代汉字由于汉字数量巨大而在字形设计方面的不足。为个性化汉字信息的云端存储和云端字形服务提供了一种有效的策略和方法,为设计更深层次的汉字信息服务奠定了基础。 |
16:10-16:30 | Leyuan Qu, Yanlu Xie and Jinsong Zhang show abstract hide abstractABSTRACT: 为了提升计算机辅助发音训练(CAPT)系统中发音偏误趋势(PET)的检测效果,确保反馈信息的准确性与有效性,提出了一种基于对数似然比的发音特征的方法,用于PET的检测;同时,通过对PET的准确检测,为学生提供发音位置与发音方法上的正音信息。实验可分为两个步骤:a)多个基于深度神经网络的发音特征提取器将用于生成帧级别的对数似然比;b)由对数似然比组成的发音特征将用于PET的检测。结果表明,发音特征对PET的检测效果优于常用声学特征(MFCC、PLP和fBank),当发音特征与MFCC特征结合时,性能可进一步提升,达到错误接受率为5.0%,错误拒绝率为30.8%,诊断正确率为89.8%的检测效果。 |
Open Fund Workshop:
Dec. 6, 2016(15:30–17:00)
No.6, 2nd Floor,Yunan Auditorium(云安会堂2楼6号),
Chair: Dongyan ZHAO
Return
|
15:30-16:00 |
|
16:00-16:00 |
基于双语对应递归自编码器的汉藏统计机器翻译研究与实现
苏劲松
|
16:30-17:00 |
|
Poster/Demo Presentations and Banquet:
Dec. 5, 2016(17:30–21:10)
Yunan International Conference Center(云安国际会议厅,2号厅),
Return
|
17:30-19:00 | Wuying Liu and Lin Wang show abstract hide abstractABSTRACT: Limited machine translation (LMT) is an unliterate automatic translation based on bi-lingual dictionary and sentence bank. This paper addresses the Japanese-Chinese LMT problem, proposes two syntactic hypotheses about Japanese texts, and designs a fast-syntax-matching-based Japanese-Chinese (FSMJC) LMT algorithm. In which, the fast syntax matching function, a modified version of Levenshtein func-tion, can predict an approximate similarity of syntactic patterns between two Japa-nese sentences by a straightforward calculating of their formal occurrences. The ex-perimental results show that the FSMJC LMT algorithm can obtain desirable effects with greatly reduced time costs, and prove that our two syntactic hypotheses are ef-fective on Japanese texts. |
17:30-19:00 | Junjie Li, Haitong Yang and Chengqing Zong show abstract hide abstractABSTRACT: Social media texts pose a great challenge to sentiment classification. Existing classification methods focus on exploiting sophisticated features or incorporating user interactions, such as following and retweeting. Nevertheless, these methods ignore user attributes such as age, gender and location, which is proved to be a very important prior in determining sentiment polarity according to our analysis. In this paper, we propose two algorithms to make full use of user attributes: 1) incorporate them as simple features, 2) design a graph-based method to model relationship between tweets posted by users with similar attributes. The extensive experiments on seven movie datasets in Sina Weibo show the superior performance of our methods in handling these short and informal texts. |
17:30-19:00 | Guoping Huang, Jiajun Zhang, Yu Zhou and Chengqing Zong show abstract hide abstractABSTRACT: Post-editing is the most popular approach to improve accuracy and speed of human translators by applying the machine translation (MT) technology. During the translation process, human translators generate the translation by correcting MT outputs in the post-editing scenario. To avoid repeating the same MT errors, in this paper, we propose an efficient framework to update MT in real-time by learning from user feedback. This framework includes: (1) an anchor-based word alignment model, being specially designed to get correct alignments for unknown words and new translations of known words, for extracting the latest translation knowledge from user feedback; (2) an online translation model, being based on random forests (RFs), updating translation knowledge in real-time for later predictions and having a strong adaptability with temporal noise as well as context changes. The extensive experiments demonstrate that our proposed framework significantly improves translation quality as the number of feedback sentences increasing, and the translation quality is comparable to that of the off-line baseline system with all training data. |
17:30-19:00 | Haoran Li, Jiajun Zhang, Yu Zhou and Chengqing Zong show abstract hide abstractABSTRACT: Multilingual multi-document summarization is a task to generate the summary in target language from a collection of documents in multiple source languages. A straightforward approach to this task is automatically translating the non-target language documents into target language and then applying monolingual summarization methods, but the summaries generated by this method is often poorly readable due to the low quality of machine translation. To solve this problem, we propose a novel graph model based on guided edge weighting method in which both informativeness and readability of summaries are taken into consideration fully. In methodology, our model attempts to choose from the target language documents the sentences which contain important shared information across languages, and also retains the salient sentences which cannot be covered by documents in other language. The experimental results on our manually labeled dataset show that our method significantly outperforms other baseline methods. |
17:30-19:00 | MeiJia Wang, Peng Zhang, Dawei Song and Jun Wang show abstract hide abstractABSTRACT: In Information Retrieval (IR), evaluation metrics continuously play an important role. Recently, some risk measures have been proposed to evaluate the downside performance or the performance variance of an assumingly advanced IR method in comparison with a baseline method. In this paper, we propose a novel risk metric, by applying the Value at Risk theory (VaR, which has been widely used in financial investment) to IR risk evaluation. The proposed metric (VaR IR) is implemented in the light of typical IR effectiveness metrics (e.g. AP) and used to evaluate the participating systems submitted to Session Tracks and compared with other risk metrics. The empirical evaluation has shown that VaR IR is complementary to and can be integrated with the effectiveness metrics to provide a more comprehensive evaluation method. |
17:30-19:00 | JUN MA and YUJIE ZHANG show abstract hide abstractABSTRACT: Conventional “pivot-based” approach of acquiring paraphrasing from bilingual corpus has limitations, where only paraphrases within two steps were considered. We propose a graph based model of acquiring paraphrases from phrases translation table. This paper describes the way of constructing graph model from phrases translation table, a random walk algorithm based on N number of steps and a confidence metric for ranking the obtained results. Furthermore, we augment the model to be able to integrate more language pairs, for instance, exploiting English-Japanese phrases translation table for finding more potential Chinese paraphrases. We performed experiments on NTCIR Chinese-English and English-Japanese bilingual corpora and compared with the conventional method. The experimental results showed that the proposed model acquired more paraphrases, and performed more well after English-Japanese phrases translation was added into the graph model. |
17:30-19:00 | Xue-feng Xi and Guodong Zhou show abstract hide abstractABSTRACT: Coreference resolution is a major task of natural language processing. Although the mention-pair model is one of the most influential learning-based coreference models, it is hard to make any further improvements of the performance because of its inherent defects. From the perspective of discourse analysis, a micro-topic model based on the theme-rheme structure is proposed for coreference resolution in this paper. Compared with the traditional mention object recognition in text space, this model reduces problem space and complexity. The effectiveness of this model was evaluated by preliminary experimental in CoNLL-2012 shared task datasets and discourse topic corpus(DTC) tagged by us. |
17:30-19:00 | Dandan Wang, Jin'an Xu, Yufeng Chen, Yujie Zhang and Xiaohui Yang show abstract hide abstractABSTRACT: 摘要:传统的EBMT方法在分析源语言和目标语言端的句法结构的基础上,构建规则和实例库实现翻译过程。此类方法的突出问题是系统构筑复杂度和成本较高。针对此问题,本文提出了一种基于依存树到串的汉英实例机器翻译(EBMT)方法,与传统方法相比,该方法只需要进行源语言端的句法结构分析,从而大大降低构筑系统的复杂度,有效降低成本。为了提高翻译精度,论文还首次融合了中文分词、词性标注和依存句法分析联合模型,旨在降低汉英EBMT中源语言端基础任务中的错误传递、提高提取层次间特征的准确性,并在此基础上结合依存结构的特征以及中英语料的特性,对依存树到串模型对进行了规则抽取以及泛化处理。实验结果表明,相对于基线系统,提出的方法可提高实例对抽取质量,改善泛化规则,有效改善译文质量,提高系统性能。
Abstract: Traditional example-based machine translation (EBMT) methods are usually constructed by using the structural information of source and target languages. The main problems of such an approach include high complexity and high cost of system construction. For resolving these problems, this paper proposed a Chinese-English tree-to-string EBMT method to relieve these problems. Comparing the traditional method, our approach just needs to implement the processing of source language parsing. For improving the performance, we firstly adopt joint model of word segmentation, POS tagging and dependency parsing into our system. The motivation of these studies is targeting to relieve the affections of error propagation and failure of feature extraction at different levels with our joint model. Moreover, we extracted and generalized bilingual word and phase alignments from examples and templates by using the dependency structure of source language according to our proposed method. Experimental results show that our proposed method can achieve better performance significantly than baseline systems. |
17:30-19:00 | Meishan Zhang, nan yu and Guohong Fu show abstract hide abstractABSTRACT: Discrete and Neural models are two mainstream methods for Chinese POS tagging nowadays.
Both have achieved state-of-the-art performances.
In this paper, we compare the two kinds of models empirically,
and further investigate the combination methods of them.
In particular, as the pre-trained word embeddings are exploited under the neural setting,
one can regard neural models as semi-supervised setting.
To make a fairer comparison of the discrete and the neural models,
we incorporate word clusters for both models as well as their combination,
since it has been generally accepted that word clusters can encode similar information as pre-trained word embeddings. |
17:30-19:00 | Wei Li, Yunfang Wu and Lv Xueqiang show abstract hide abstractABSTRACT: Using low dimensional vector space to represent words has been very effective in many NLP tasks. However, it doesn't work well when faced with the problem of rare and unseen words. In this paper, we propose to leverage the knowledge in semantic dictionary in combination with some morphological characteristics to build an enhanced vector space. We get an improvement of 2.3% over the state-of-the-art rule-based Heidel Time system in temporal expression extraction and obtains a large gain in other NER tasks over word2vec. The Hownet alone also shows promising result in computing lexical similarity. |
17:30-19:00 | Jing Wu, Hongxu Hou, Zhipeng Shen, Jian Du and Jinting Li show abstract hide abstractABSTRACT: Neural machine translation (NMT) has shown very promising results for some resourceful languages like En-Fr and En-De. The success partly relies on the availability of large scale and high quality parallel corpora. We research on how to adapt NMT to very low-resource Mongolian-Chinese machine translation by introducing attention mechanism, sub-words translation, monolingual data and a NMT correction model. We proposed a sub-words model to address the out-of-vocabulary (OOV) problem in attention-based NMT model. Monolingual data help alleviate the low-resource problem. Besides, we explore a Chinese NMT correction model to enhance the translation performance. The experiments show that the adapted Mongolian-Chinese attention-based NMT machine translation obtains an improvement of 1.70 BLEU points over the phrased-based statistical machine translation baseline and 3.86 BLEU points over normal NMT baseline on an open training set. |
17:30-19:00 | Meng Yang, Peifeng Li and Qiaoming Zhu show abstract hide abstractABSTRACT: Most previous approaches used various kinds of plain similarity features to represent the similarity of a sentence pair, and one of its limitations is its weak representation ability. This paper introduces the relational structures re-presentation (shallow syntactic tree, dependency tree) to compute sentence similarity. Experimental results manifest that our approach achieves higher performance than that only uses plain features. |
17:30-19:00 | yang zhizhuo, zhang hu, chen qian and tan hongye show abstract hide abstractABSTRACT: Word Sense Disambiguation (WSD) is one of the key issues in natural language processing. Currently, supervised WSD methods are effective ways to solve the ambiguity problem. However, due to lacking of large-scale training data, they cannot achieve satisfactory results. In this paper, we present a WSD method based on context translation. The method is based on the assumption that translation under the same context expresses similar meanings. The method treats context words consisting of translation as the pseudo training data, and then derives the meaning of ambiguous words by utilizing the knowledge from both training and pseudo training data. Experimental results show that the proposed method can significantly improve traditional WSD accuracy by 3.17%, and outperformed the best participating system in the SemEval-2007: task #5 evaluation. |
17:30-19:00 | Dongxu Zhang, Tianyi Luo and Dong Wang show abstract hide abstractABSTRACT: Bayesian models and neural models have demonstrated their respective advantage in topic modeling. Motivated by the dark knowledge transfer approach proposed by G. Hinton et al, we present a novel method that combines the advantages of the two model families. Particularly, we present a transfer learning method that uses LDA to supervise the training of a deep neural network (DNN), so that the DNN can approximate the LDA inference with less computation. Our experimental results show that by transfer learning, a simple DNN can approximate the topic distribution produced by LDA pretty well, and deliver competitive performance as LDA on document classification, with much faster computation. |
17:30-19:00 | Weihua Wang, Feilong Bao and Guanglai Gao show abstract hide abstractABSTRACT: In this paper, we first create a Cyrillic Mongolian named entity manually annotated corpus. The annotation types contain person names, location names, organization names and other proper names. Then, we use Condition Random Field as classifier and design few categories features of Mongolian, including orthographic feature, morphological feature, gazetteer feature, syllable feature, word clusters feature etc. Experimental results show that all the proposed features improve the overall system performance and stem features improve the most among them. Finally, with combination of all the features our model obtain the optimal performance. |
17:30-19:00 | Chao Lv, Yansong Feng and Dongyan Zhao show abstract hide abstractABSTRACT: In this paper, we propose a machine learning approach
to solve the purchase prediction task launched by the Alibaba Group.
In detail, we treat this task as a binary classification problem
and explore five kinds of features to learn potential model
of the influence of historical behaviors.
These features include user quality, item quality, category quality,
user-item interaction and user-category interaction.
Due to the nature of mobile platform,
time factor and spacial factor are considered specially.
Our approach ranks the 26th place among 7186 teams in this task. |
17:30-19:00 | Yaocheng Gui, Qian Liu, Man Zhu and Zhiqiang Gao show abstract hide abstractABSTRACT: Distant supervision is an efficient approach for various tasks, such as relation extraction. Most of the recent literature on distantly supervised relation extraction generates labeled data by heuristically aligning knowledge bases with text corpora and then trains supervised relation classification models based on statistical learning. However, extracting long tail relations from the automatically labeled data is still a challenging problem even in big data. Inspired by explanation-based learning (EBL), this paper proposes an EBL-based approach to tackle this problem. The proposed approach can learn relation extraction rules effectively using unlabeled data. Experiments on the New York Times corpus demonstrate that our approach outperforms the baseline approach especially on long tail data. |
17:30-19:00 | Bo Xu, Hongfei LIN, Mingzhen Zhao, Zhihao Yang, Jian Wang and zhang shaowu show abstract hide abstractABSTRACT: In recent years, adverse drug reactions have drawn more and more attention from the public, which may lead to great damage to the public health and cause massive economic losses to our society. As a result, it becomes a great challenge to detect the potential adverse drug reactions before and after putting drugs into the market. With the development of the Internet, health-related social networks have accumulated large amounts of users’ comments on drugs, which may contribute to detect the adverse drug reactions. To this end, we propose a novel framework to detect potential adverse drug reactions based on health-related social networks. In our framework, we first extract mentions of diseases and adverse drug reactions from users’ comments using conditional random fields with different levels of features, and then filter the indications of drugs and known adverse drug reactions by external biomedical resources to obtain the potential adverse drug reactions. On the basis, we propose a modified Skip-gram model to discover associated proteins of potential adverse drug reactions, which will facilitate the biomedical experts to determine the authenticity of the potential adverse reactions. Extensive experiments based on DailyStrength show that our framework is effective for detecting potential adverse drug reactions from users’ comments. |
17:30-19:00 | Te Luo, YUJIE ZHANG, Jinan XU and Yufeng Chen show abstract hide abstractABSTRACT: Since Chinese dependency parsing is lack of a large amount of manually annotated dependency treebank. Some unsupervised methods of using large-scale unannotated data are proposed and inevitably introduce too much noise from automatic annotation. In order to solve this problem, this paper proposes an approach of iteratively integrating unsupervised features for training Chinese dependency parsing model. Considering that more errors occurred in parsing longer sentences, this paper divide raw data according to sentence length and then iteratively train model. The model trained on shorter sentences will be used in the next iteration to analyze longer sentences. This paper adopts a character-based dependency model for joint word segmentation, POS tagging and dependency parsing in Chinese. The advantage of the joint model is that one task can be promoted by other tasks during processing by exploring the available internal results from the other tasks. The higher accuracy of the three tasks on shorter sentences can bring about higher accuracy of the whole model. This paper verified the proposed approach on the Penn Chinese Treebank and two raw corpora. The experimental results show that F1-scores of the three tasks were improved at each iteration, and F1-score of the dependency parsing was increased by 0.33%, compared with the conventional method. |
17:30-19:00 | Gongbo TANG, Gaoqi RAO, Dong Yu and Endong XUN show abstract hide abstractABSTRACT: Distributed representation is the most popular way to capture semantic and syn-tactic features recently, and it has been widely used in various natural language processing tasks. Function words express a grammatical or structural relationship with other words in a sentence. However, previous works merely considered that function words are equal to content words or neglected function words, there is no experimental analyses about function words. In this paper, we explored the ef-fect of function words on word embedding with a word analogy reasoning task and a paraphrase identification task. The results show that neglecting function words has different effects on syntactic and semantic related tasks, with an in-crease or a decrease in accuracy, moreover, the model of training word embed-dings does also matter. |
17:30-19:00 | Yinfeng Zou, Chunping Ouyang, Yongbin Liu and Xiaohua Yang show abstract hide abstractABSTRACT: Abstract ."HowNet" is a popular platform of Chinese text similarity calculation. The study has found that there is still some short-comings about the effect of "HowNet" architecture, the organization of vocabulary, concept description on word similarity measurement. In hence, on the basis of analyzing the generality and individuality of words in "HowNet", a similarity algorithm based on the generality and individuality of words is proposed. Furthermore, experimental data is from NLPCC-ICCPOL 2016 Chinese words similarity evaluation task data set. Experimental results show that the algorithm is more feasible and stable, and better than some of the other classic algorithms. Moreover, the size of experimental data sets has a little influence on experimental results. In all experiments, the Pearson correlation coefficient and the Spearman’s coefficient have stably reached 0.460 and 0.440. |
17:30-19:00 | PENG Jian, YANG Xiaohua, OUYANG Chunping and LIU Yongbin show abstract hide abstractABSTRACT: Feature selection algorithm plays an important role in text categorization. Considering some drawbacks proposed from traditional and recently improved information gain(IG) approach, an improved IG feature selection method based on relative document frequency distribution is proposed, which combines reducing the impact of unbalanced data sets and low-frequency characteristics, the frequency distribution of features within category and the relative frequency document distribution of features among different categories. The experimental results of NLPCC-ICCPOL 2016 stance detection in Chinese microblogs show that the performance of the improved method is better than traditional IG approach and another improved method in feature selection. |
17:30-19:00 | Bin Hao, Min Zhang, Weizhi Ma, Jiashen Sun, Yiqun Liu, Shaoping Ma, Xuan Zhu and Hengliang Luo show abstract hide abstractABSTRACT: Nowadays users like to share their opinions towards a product/service or pol-icy in social media, which is important to the manufacturers and govern-ments to collect feedbacks from the crowds. While in microblogs, infor-mation is highly unbalanced that lots of posts are published and spread by ghost-writers/spammers, sellers, official accounts, etc, but information pro-vided by the true crowds is overwhelmed frequently. Previous studies mostly concern on how to find one specific type of users; but do not investigate how to filter multiple types of specific users so as to keep only the true crowds, which is the main topic of this work. In this paper, we first show the categorization on four different types of users, namely ghost-writers, sellers, official accounts and end-users (the former three are noted as a broad sense advertisers in the paper), and study their characteristics. Then we propose a Topic-Specific Divergence based model to filter out advertisers so that end-users can be kept. Meta-information, content are investigated in comparative analysis. Encouraging experimental results on real dataset clearly verify that the proposed approach outperforms the state-of-art methods significantly. |
17:30-19:00 | Dongxu Zhang and Dong Wang show abstract hide abstractABSTRACT: Convolutional neural networks (CNN) have delivered competitive performance on relation classification, without tedious feature engineering. A particular shortcoming of CNN, however, is that it is less powerful in modeling long-span relations. This paper presents a model based on recurrent neural networks (RNN) and compares the capabilities of CNN and RNN on the relation classification task. We conducted a thorough comparative study on two databases: one is the popular SemEval-2010 Task 8 dataset, and the other is the KBP37 dataset we designed based on MIML-RE, with the goal of learning and testing complex relations. The experimental results strongly indicate that even with a simple RNN structure, the model can deliver much better performance than CNN, particularly for long-span relations. |
17:30-19:00 | Xiao-Bo Jin, Guang-Gang Geng, Kaizhu Huang and Zhi-Wei Yan show abstract hide abstractABSTRACT: Entity search is a new application meeting either precise or vague requirements from the search engines users. Baidu Cup 2016 Challenge just provided such a chance to tackle the problem of the entity search. We achieved the first place with the average MAP scores on 4 tasks including movie, tvShow, celebrity and restaurant. In this paper, we propose a series of similarity features based on both of the word frequency features and the word semantic features and describe our ranking architecture and experiment details. |
17:30-19:00 | Kun Li, Yumei Chai, Hongling Zhao, Xiaofei Nan and Yueshu Zhao show abstract hide abstractABSTRACT: De-identification in electronic health records is a prerequisite to distribute medical records for further clinical data processing or mining. In this paper, we introduce a framework based on recurrent neural network to solve the de-identification problem, and compare state-of-the-art methods with our framework. It is integrated, which includes records skeleton generation, chunk representation and protected information labeling. We evaluate our framework on three datasets involving two English datasets from i2b2 de-identification challenge and a Chinese dataset we created. To the best of our knowledge, we are the first to apply RNN model to the Chinese de-identification problem. The experimental results indicate that our framework not only achieves high performance but also has strong generalization ability. |
17:30-19:00 | Teng Wang, Xueqiang Lv, Xun Ma, Pengyan Sun, Zhian Dong and Jianshe Zhou show abstract hide abstractABSTRACT: For a given query, searching for entities that conform to the description facts in the given set, in view of this goal, this paper proposes a matching method based on classification and semantic extension. The algorithm firstly to classify the query string into three categories, and extract the key word of different categories of query word .Then the keyword is extended to get the matching word set based on the word2vec word vector model. At last we calculate the score of every entity by the weighted matching method and get results according to the score ranking. After the experiment, the method get the correct rate of 63.2%, which has good applicability, and to a certain extent, it reduces the retrieval failure rate due to the query of the spoken language and diversification. |
17:30-19:00 | Yunfang Wu and Wei Li show abstract hide abstractABSTRACT: Word similarity computation is a fundamental task for natural language processing. We organize a semantic campaign of Chinese word similarity measurement at NLPCC-ICCPOL 2016. This task provides a dataset of Chi-nese word similarity, including 500 word pairs with their similarity scores. Together 21 teams submit 24 systems in this campaign. In this paper, we describe clearly the data preparation, introduce the task setup, make an in-depth analysis on the evaluation results and make a brief introduction to participating systems. |
17:30-19:00 | Qingying Sun, Zhongqing Wang, Qiaoming Zhu and Guodong Zhou show abstract hide abstractABSTRACT: In this paper, we describe our participation in the fourth shared task (NLPCC-ICCPOL 2016 Shared Task 4) on the stance detection in Chinese Micro-blogs (subtask A). Different from ordinary features, we explore four linguistic features including lexical features, morphology features, semantic features and syntax features in Chinese micro-blogs in stance classifier, and get a good performance, which ranks the third place among sixteen systems. |
17:30-19:00 | Ke SUN, Tingting Li, Shiqi Zhao, Yajuan Lv, Yansong FENG, Xiaojun Wan and Dongyan Zhao show abstract hide abstractABSTRACT: Baidu Cup 2016 challenges participants to tackle the problem of entity search in the scenario of Duer. In this paper, we present the overview of this challenge, including the overview of participants, the definition of the task, how we prepare the challenge data and the final challenge result. |
17:30-19:00 | Leng Yabin, Liu Weiwei, Wang Sheng and Wang Xiaojie show abstract hide abstractABSTRACT: This paper describes our system for Chinese word segmentation of micro-blog text, one of the NLPCC-ICCPOL 2016 Shared Tasks. The CRF (Conditional Random Field) model is employed to model word segmentation as a sequence labeling problem, 7 sets of features are selected to train the CRF model. The system achieves f_b 0.8041 on closed track, 0.82417 on semi-open track, and 0.8306 on open track with weighed measures |
17:30-19:00 | Shutian Ma, Xiaoyong Zhang and Chengzhi Zhang show abstract hide abstractABSTRACT: Many Chinese words similarity measure algorithms have been introduced since it’s a fundamental issue in various tasks of natural language processing. Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora. However, knowledge base and corpus have limitations for broad cover-age and data update. Thus, ensemble learning is then used to improve perfor-mance by combing similarities. This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms. To be specific, knowledge-based methods are based on TYCCL and Hownet. Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora (news and microblog). All similarities are combined through support vector regression to get final similarity. Evaluation suggests that TYCCL-based method behaves best according to testing dataset. However, if tuning parameters appropriately, ensemble learning could outperform all the other algorithms. Besides, deep learning on news corpora is better than other corpus-based methods. |
17:30-19:00 | Xiaojun Wan, Jianmin Zhang, Jin-ge Yao and Tianming Wang show abstract hide abstractABSTRACT: Live webcast scripts are valuable resources for describing the process of sports games. This shared task aims to automatically generate sports news articles from live webcast scripts. The task can be considered a special case of single document summarization. In this overview paper, we will introduce the task, the evaluation dataset, the participating teams and the evaluation results. The dataset has been released publicly. |
17:30-19:00 | Maofu Liu, Qiaosong Qi, Huijun Hu and Han Ren show abstract hide abstractABSTRACT: With the dramatic increase of the live webcast scripts about sports, it is an urgent demand to write and publish a sports news article immediately after a sports game. However, so far, the sports news articles are usually written by human ex-perts or journalists, and the manual writing of sports news is time-consuming and inefficient. This paper describes our system on the sports news generation from live webcast scripts task. On one hand, our system extracts the important events occurring in the time period from the live webcast scripts according to the rules, and on the other hand, our system generates a brief summary from the live webcast scripts about the football matches. According to the characteristic of live webcast scripts, we adopt an approach to sentence extraction and template genera-tion from live webcast scripts. The evaluation results show that our system is fea-sible in sports news generation from live webcast scripts. |
17:30-19:00 | Linjie Wang, Yu Zhang and Ting Liu show abstract hide abstractABSTRACT: With the increase of the scale of the knowledge base, it’s important to answer a question over knowledge base. Up to now, researchers have made lots of works in English, but the research of answering a question over Chinese knowledge base is relatively few. In this paper , we will introduce a method to extract answers from Chinese knowledge base for Chinese questions. Our method uses a classifier to judge whether the relation in the triple is what the question asked, question-relation pairs are used to train the classifier. It’s difficult to identify the right relation, so we find out the focus of the question and leverage the resource of lexical paraphrase in the preprocessing of the question. And the use lexical paraphrase also can alleviate the out of vocabulary(OOV) problem. In order to let the right answer at the top of candidate answers, we present a ranking method to rank these candidate answers. The result of the final evaluation shows that our method achieves a good result. |
17:30-19:00 | nan yu, Da Pan, Meishan Zhang and Guohong Fu show abstract hide abstractABSTRACT: In this paper, we presents a stance detection system for NLPCC-ICCPOL 2016 share task 4. Our Stance Detection System can determinate whether the author of Weibo text is in favor of the given target, against the given target, or neither. We exploit LSTMs model and the average F score of our system is 56.56%. In contrast to the traditional target/aspect sentiment, the given target may notbe preserved in Weibo text. We model the task as a classification problem, exploiting LSTMs as the basic part of classifier. |
17:30-19:00 | Xipeng Qiu, Peng Qian and Zhan Shi show abstract hide abstractABSTRACT: In this paper, we give an overview for the shared task at the 5th CCF Conference on Natural Language Processing \& Chinese Computing (NLPCC 2016):
Chinese word segmentation for micro-blog texts.
Different with the popular used newswire datasets, the dataset of this shared task consists of the relatively informal micro-texts. Besides, we also use a new psychometric-inspired evaluation metric for Chinese word segmentation, which addresses to balance the very skewed word distribution at different levels of difficulty. |
17:30-19:00 | Ruifeng Xu, Yu Zhou, Dongyin Wu, Lin Gui, Jia Chen Du and Yun Xue show abstract hide abstractABSTRACT: Abstract. We for the first time present a shared task of stance detection in Chinese Microblogs, which means automatically determining from text whether the author is in favor of the given target, against the given target, or whether neither inference is likely. The text may or may not contain the target of interest, and the opinion expressed may or may not be towards to the target of interest. We designed two tasks. Task A is a mandatory supervised task detecting stance towards five targets of interest with given labeled data. Task B is an optional unsupervised task given only some unlabeled data as example. Our shared task has had sixteen team participants and five of them submitted results of Task B. The highest F-score obtained was 0.7106 for Task A and 0.4687 for Task B. |
17:30-19:00 | Fengyu Yang, Liang Gan, Aiping Li, Dongchuan Huang, Xiaohui Chou and Hongmei Liu show abstract hide abstractABSTRACT: This paper presents a system which learns to answer single-relation questions on a broad range of topics from a knowledge base using a three-layered learning system. Our system first learning a Topic Phrase Detecting model based on a phrase-entities dictionary to detect which phrase is the topic phrase of the ques-tion. The second layer of the system learning several answer ranking models. The last layer re-ranking the scores from the output of the second layer and return the highest scored answer. Both convolutional neural networks (CNN) and infor-mation retrieval (IR) models are included in this models. Training our system us-ing pairs of questions and structured representations of their answers, yields competitive results on the NLPCC 2016 KBQA share task. |
17:30-19:00 | Fangying Wu show abstract hide abstractABSTRACT: Document-based question answering (DBQA) is a sub-task of open-domain question answering, targeted at selecting the answer sentence(s) from the given documents for a question. In this paper, we propose a hybrid approach to select answer sentences, combining existing models via the rank SVM model. Specifically, we capture the inter-relationship between the question and answer sentences from three aspects: surface string similarity, deep semantic similarity and relevance based on information retrieval models. Our experiment results show that an improved retrieval model out-performs other methods, including the deep learning models. And, applying a rank SVM model to combine all these features, we achieve 0.8120 in mean reciprocal rank (MRR) and 0.8111 in mean average precision (MAP) in the opening test. |
17:30-19:00 | Benyou Wang, Jiabin Niu, Liqun Ma, Yuhua Zhang, Lipeng Zhang, Jingfei Li, Peng Zhang and Dawei Song show abstract hide abstractABSTRACT: Document-based Question Answering system, which needs to match semantically the short text pairs, has gradually become an important topic in the fields of natural language processing and information retrieval. Question Answering system based on English corpus has developed rapidly with the utilization of the deep learning technology, whereas an effective Chinese-customized system needs to be paid more attention. Thus, we explore a Question Answering system which is characterized in Chinese for the QA task of NLPCC. In our approach, the ordered sequential information of text and deep matching of semantics of Chinese textual pairs have been captured by our count-based traditional methods and embedding-based neural network. The ensemble strategy has achieved a good performance which is much stronger than the provided baselines. |
17:30-19:00 | Nan Duan show abstract hide abstractABSTRACT: Abstract goes here |