NLPCC 2015 Main Conference Program

Sunday, October 11, 2015
14:30–21:00	Treasure Palace Hotel（园中源大酒店）, 1st Floor		Registration
17:00–20:00	Treasure Palace Hotel-Good Lily（佳百合）		Dinner(Buffet)
19:30–22:30	Treasure Palace Hotel-No.5 Meeting Room（二楼五号）		TCCI Business Meeting(Only TCCI Members)
Monday, October 12, 2015
08:00–08:50	Treasure Palace Hotel（园中源大酒店）, 1st Floor		Registration
08:30–08:50	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）		Opening Ceremony Session Chair: Mingwen WANG
08:50–09:50	Invited Talk 1, by Prof. Dan Roth Learning and Inference for Natural Language Understanding Slides
	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）		Session Chair: Kam-Fai WONG
	show abstract hide abstract Abstract: Machine Learning and Inference methods have become ubiquitous and have had a broad impact on a range of scientific advances and technologies and on our ability to make sense of large amounts of data. Research in Natural Language Processing has both benefited from and contributed to advancements in these methods and provides an excellent example for some of the challenges we face moving forward. I will describe some of our research in developing learning and inference methods in pursue of natural language understanding. In particular, I will address what I view as some of the key challenges, including (i) learning models from natural interactions, without direct supervision, (ii) knowledge acquisition and the development of inference models capable of incorporating knowledge and reason, and (iii) scalability and adaptation—learning to accelerate inference during the life time of a learning system. A lot of this work is done within the unified computational framework of Constrained Conditional Models (CCMs), an Integer Linear Programming formulation that augments statistically learned models with declarative constraints as a way to support learning and reasoning. Within this framework, I will discuss old and new results pertaining to learning and inference and how they are used to push forward our ability to develop programs that understand natural language. show biography hide biography Keynote Speaker: Prof. Dan Roth Computer Science and the Beckman Institute, University of Illinois at Urbana/Champaign Dan Roth is a Professor in the Department of Computer Science and the Beckman Institute at the University of Illinois at Urbana-Champaign and a University of Illinois Scholar. Roth is a Fellow of the American Association for the Advancement of Science (AAAS), the Association of Computing Machinery (ACM), the Association for the Advancement of Artificial Intelligence (AAAI), and the Association of Computational Linguistics (ACL), for his contributions to Machine Learning and to Natural Language Processing. He has published broadly in machine learning, natural language processing, knowledge representation and reasoning, and learning theory, and has developed advanced machine learning based tools for natural language applications that are being used widely by the research community and commercially. Roth is the Editor-in-Chief of the Journal of Artificial Intelligence Research (JAIR). He was the program chair of AAAI’11, ACL’03 and CoNLL'02. Prof. Roth received his B.A Summa cum laude in Mathematics from the Technion, Israel, and his Ph.D in Computer Science from Harvard University in 1995.
09:50–10:20	Entrance of the Hotel		Group Photo
10:20–10:40	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）		Coffee/Tea Break
10:40–11:40	Invited Talk 2, by Dr. Hang Li Toward Building A Natural Language Dialogue System Using Big Data and Deep Learning Slides
	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）		Session Chair: Ming ZHOU
	show abstract hide abstract Abstract: Natural language dialogue is regarded as one of the most challenging problems in artificial intelligence. We argue that it is a right time to conduct more research on the problem, given that more and more dialogue data has been accumulated on social media and more and more advanced machine learning technologies such as deep learning have been developed. At Noah’s Ark Lab of Huawei Technologies, we dare to take the challenge and try to build a natural language dialogue system using big data and deep learning. At the first step, we particularly investigate how to conduct one round of dialogue, referred to as short text conversation (STC), in which given a message from human, the computer returns a reasonable response to the message. We consider two approaches to conducting STC, a retrieval based approach and a generation based approach. Several models based on the approaches have been developed using a large amount of STC data and deep learning, which can make the system return reasonable output to human’s input in more than 70% of the cases. The interesting question, then, is: what performance can the system achieve with more advanced machine learning techniques and much more data? We wish to work with the community to address this interesting yet challenging problem. show biography hide biography Keynote Speaker: Dr. Hang Li Hang Li is director of the Noah’s Ark Lab of Huawei Technologies. His research areas include information retrieval, natural language processing, statistical machine learning, and data mining. He graduated from Kyoto University in 1988 and earned his PhD from the University of Tokyo in 1998. He worked at the NEC lab in Japan during 1991 and 2001, and Microsoft Research Asia as Senior Researcher and Research Director during 2001 and 2012. He joined Huawei Technologies in 2012.
11:40–13:30	Treasure Palace Hotel-Good Lily（佳百合）		Lunch(Buffet)
13:30–15:10	Treasure Palace Hotel-Four Seasons（二楼四季）	Treasure Palace Hotel-No.5 Meeting Room（二楼五号）		Treasure Palace Hotel-V18 Meeting Room（四楼V18）
	Fundamentals	Sentiment Analysis		Shared Task
15:10–15:40	Addr(TBD)		Coffee/Tea Break
15:40–17:00	Treasure Palace Hotel-Four Seasons（二楼四季）	Treasure Palace Hotel-No.5 Meeting Room（二楼五号）		Treasure Palace Hotel-V18 Meeting Room（四楼V18）
	Chinese Language Computing I	Machine Translation		Evaluation/Open Fund Workshop(Chinese Session)
17:30–21:00	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）	Poster/Demo Presentations and Banquet, Innovation Demo
Tuesday, October 13, 2013）
08:30–09:30	Invited Talk 3, by Prof. Jian-Yun Nie The role of NLP in IR Slides
	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）		Session Chair: Chengqing ZONG
	show abstract hide abstract Abstract: As information retrieval (IR) deals with textual documents in most cases, one would believe that NLP should play an important role in IR. However, the current IR systems do not rely extensively on NLP methods. Over several decades, IR methods using NLP techniques also have not shown the expected performance gain. Is this because IR does not require NLP, or because many NLP methods are not suitable to IR applications? In this talk, I will discuss several successful and unsuccessful utilizations of NLP techniques in IR and argue that the NLP techniques required in IR should be robust and flexible in order to cope with noisy and less constrained languages. show biography hide biography Keynote Speaker: Prof. Jian-Yun Nie Jian-Yun Nie is a professor in University of Montreal. He obtained his PhD from University of Grenoble (France) on information retrieval. Since then, his research has always been focused on information retrieval and natural language processing. Among other topics, he has worked on IR models, cross-language IR, query expansion and query understanding. Jian-Yun Nie has published a number of papers on these topics and his papers have been widely cited. He published a monograph on cross-language information retrieval (Morgan and Claypool, 2010). He is on editorial board of several international journals, and is a regular PC member of the major conferences in these areas (SIGIR, CIKM, ACL, etc.). He has also been the general chair of SIGIR conference in 2011 held in Beijing.
09:30–10:20	Best Papers
	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）		Session Chair: Juanzi LI
10:20–10:40	Addr(TBD)		Coffee/Tea Break
10:40–12:10	Panel: NLP in Big Data and Deep Learning Era
	Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）		Moderator: Min Zhang, Soochow University
	Show Details Hide Details Panel Info: NLP research and development have achieved dramatic progress in both academia and industry with the help of statistical methods and machine learning algorithms over text corpus. One fundamental assumption behind this achievement is that text corpus contains rich linguistic phenomena, and the knowledge can be extracted in computational manner from the text data. Nowadays, On hte one hand, NLP R&D results have been widely utilized in many real applications and generated great impact on each individual and the society; on the other hand, NLP system is still rather naive in intelligence. NLPCC 2015 organizes this panel session to discuss how NLP can benefit from Big Data and Deep Learning during the regular program. Panel Invited Speakers: Changning Huang, Tsinghua University and MSRA Jiawei Han, University of Illinois at Urbana-Champaign Jian-Yun Nie, University of Montreal Kam-Fai Wong, The Chinese University of Hong Kong Dongyan Zhao, Peking University Hang Li, Noah's Ark Lab, Huawei Technologies Guodong Zhou, Soochow University
12:10–14:00	Treasure Palace Hotel-Good Lily（佳百合）		Lunch(Buffet)
14:00–15:20	Treasure Palace Hotel-Four Seasons（二楼四季）	Treasure Palace Hotel-No.5 Meeting Room（二楼五号）		Treasure Palace Hotel-V18 Meeting Room（四楼V18）
	Chinese Language Computing II	Event and Topic Models		NLP for Search
15:20–15:50	Addr(TBD)		Coffee/Tea Break
15:50–17:10	Treasure Palace Hotel-Four Seasons（二楼四季）	Treasure Palace Hotel-No.5 Meeting Room（二楼五号）		Treasure Palace Hotel-V18 Meeting Room（四楼V18）
	NLP for Social Media	Knowledge Acquisition		NLP Applications
14:00–17:00	Xiansu Building, Jiangxi Normal University(江西师范大学先骕楼), exhibition at the second floor, report at the meeting room in the fourth floor			NLPCC Campus Open Day

Detail Information

Best Papers: Oct. 13, 2015（09:30–10:20）, Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）, Chair: Juanzi LI Return
09:30-09:55	Clustering Sentiment Phrases in Product Reviews by Constrained Co-clustering Yujie Cao, Minlie Huang and Xiaoyan Zhu show abstract hide abstract ABSTRACT: Clustering sentiment phrases in product reviews is convenient for us to get the most important information about one product directly through thou- sands of reviews. There are mainly two components in a sentiment phrase, the aspect word and the opinion word. We need to cluster these two parts simulta- neously. Although several methods have been proposed to cluster words or phrases, limited work has been done on clustering two-dimensional sentiment phrases. In this paper, we apply a two-sided hidden Markov random field (HMRF) model on this task. We use the approach of constrained co-clustering with some priori knowledge, in a semi-supervised setting. Experimental results on sentiment phrases extracted from about 0.7 million mobile phone reviews show that this method is promising for this task and our method outperforms baselines remarkably.
09:55-10:20	Recurrent Neural Networks with External Memory for Language Understanding Baolin Peng, Kaisheng Yao, Jing Li and Kam-Fai Wong show abstract hide abstract ABSTRACT: Recurrent Neural Networks (RNNs) have become increas- ingly popular for the task of language understanding. In this task, a semantic tagger is deployed to associate a semantic label to each word in an input sequence. The success of RNN may be attributed to its ability to memorise long-term dependence that relates the current-time semantic label prediction to the observations many time instances away. However, the memory capacity of simple RNNs is limited because of the gradient vanishing and exploding problem. We propose to use an external memory to improve memorisation capability of RNNs. Experiments on the ATIS dataset demonstrated that the proposed model was able to achieve the state-of-the-art results. Detailed analysis may provide insights for future research.
Fundamentals: Oct. 12, 2015（13:30–15:10）, Treasure Palace Hotel-Four Seasons（二楼四季）, Chair: Guodong ZHOU Return
13:30-13:50	Transition-based Dependency Parsing with Long Distance Collocations Chenxi Zhu, Xipeng Qiu and Xuanjing Huang show abstract hide abstract ABSTRACT: Long distance dependency relation is one of the main challenges for the state-of-the-art transition-based dependency parsing algorithms. In this paper, we propose a method to improve the performance of transition-based parsing with long distance collocations. With these long distance collocations, our method pro- vides an approximate global view of the entire sentence, which is a little bit sim- ilar to top-down parsing. To further improve the accuracy of decision, we extend the set of parsing actions with two more fine-grained actions based on the types of arcs. Experimental results show that our method improve the performance of parsing effectively, especially for long sentence.
13:50-14:10	Improving Chinese Dependency Parsing with Lexical Semantic Features Lvexing Zheng, Houfeng Wang and Xueqiang Lv show abstract hide abstract ABSTRACT: Lexical semantic information plays an important role in su- pervised dependency parsing. In this paper, we add lexical semantic features to the feature set of a parser, obtaining improvements on the Penn Chinese Treebank. We extract semantic categories of words from HowNet, and use them as semantic information of words. Moreover, we investigate the method to compute semantic similarity between Chinese compound words, and obtain semantic information of words which did not record in HowNet. Our experiments show that unla- beled attachment scores can increase by 1.29%.
14:10-14:30	A Maximum Entropy Approach Discourse Coherence Modeling Rui Lin, Muyun Yang, Shujie Liu, Sheng Li and Tiejun Zhao show abstract hide abstract ABSTRACT: This paper introduces a maximum entropy method to Discourse Co- herence Modeling (DCM). Different from the state-of-art supervised entity-grid model and unsupervised cohesion-driven model, the model we proposed only takes as input lexicon features, which increases the training speed and decoding speed significantly. We conduct an evaluation on two publicly available benchmark data sets via sentence ordering tasks, and the results confirm the ef- fectiveness of our maximum entropy based approach in DCM.
14:30-14:50	Convolutional Neural Networks for Correcting English Article Errors Chengjie SUN, Xiaoqiang Jin, Lei Lin, Yuming Zhao and Xiaolong Wang show abstract hide abstract ABSTRACT: In this paper, convolutional neural networks are employed for English article error correction. Instead of employing features relying on human ingenuity and prior natural language processing knowledge, the words surrounding the context of the article are taken as features. Our approach could be trained both on an error annotated corpus and an error non-annotated corpus. Experiments are conducted on CoNLL- 2013 data set. Our approach achieves 38.10% in F1, and outperforms the best system (33.40%) that participates in the task. Experimental results demonstrate the effectiveness of our proposed approach.
14:50-15:10	Context-dependent Metaphor Interpretation Based on Semantic Relatedness Chang Su, Shuman Huang and Yijiang Chen show abstract hide abstract ABSTRACT: The previous work of metaphor interpretation mostly fo- cused on single-word verbal metaphors and ignored the influence of con- textual information, leading to some limitations(e.g. ignore the polyse- my of metaphor). In this paper, we creatively propose the aspect-based semantic relatedness, and we present a novel metaphor interpretation method based on semantic relatedness for context-dependent nominal metaphors. First, we obtain the possible comprehension aspects accord- ing to the properties of source domain. Then, combined with contextual information, we calculate the degree of relatedness between the target and source domains from different aspects. Finally, we select the aspect which makes the relatedness between target and source domains maxi- mum as comprehension aspect, and the metaphor explanation is formed with corresponding property of source domain. The results show that our method has higher accuracy. In particular, when the information of target domain is insufficient in corpus, our method still exhibits the good performance.
Sentiment Analysis: Oct. 12, 2015（13:30–15:10）, Treasure Palace Hotel-No.5 Meeting Room（二楼五号）, Chair: Ruifeng XU Return
13:30-13:50	Exploiting Lexical Sentiment Membership-based Features to Polarity Classification Jiaying Song, Xu Huang and Guohong Fu show abstract hide abstract ABSTRACT: Feature selection and representation has been a key issue for polarity classification. This paper presents a lexical sentiment membership based feature representation for Chinese polarity classification under the framework of fuzzy set theory. To this end, we first use TF-IDF weighted words to construct the corresponding positive and negative polarity membership for each feature word. Then, we compute the log-ratio of each membership. Finally, we take the membership log-ratios as features and thus build a polarity classifier based on support vector machines. We also evaluate our approach over different datasets, including a corpus of reviews on automobile products, the NLPCC2014 data for sentiment classification evaluation and the IMDB film comments. The experimental results show that the proposed sentiment membership feature representation outperforms the Boolean features, frequent-based features and the word embeddings based features.
13:50-14:10	Convolutional Neural Networks for Multimedia Sentiment Analysis Guoyong Cai and Binbin Xia show abstract hide abstract ABSTRACT: Recently, user generated multimedia contents (e.g. text, image, speech and video) on social media are increasingly used to share their experiences and emotions, for example, a tweet usually contains both texts and images. Com- pared to sentiment analysis of texts and images separately, the combination of text and image may reveal tweet sentiment more adequately. Motivated by this rationale, we propose a method based on convolutional neural networks (CNN) for multimedia (tweets consist of text and image) sentiment analysis. Two indi- vidual CNN architectures are used for learning textual features and visual fea- tures, which can be combined as input of another CNN architecture for exploiting the internal relation between text and image. Experimental results on two re- al-world datasets demonstrate that the proposed method achieves effective per- formance on multimedia sentiment analysis by capturing the combined infor- mation of texts and images.
14:10-14:30	A Cross-domain Sentiment Classification Method Based on Extraction of Key Sentiment Sentence Shaowu Zhang, Huali Liu, Liang Yang and Hongfei LIN show abstract hide abstract ABSTRACT: Cross-domain sentiment analysis focuses on these problems where the source domain and the target domain are from different do- mains. However, traditional sentiment classification approaches usually perform poorly to address cross-domain problems. So, this paper pro- posed a cross-domain sentiment classification method based on extrac- tion of key sentiment sentence. Firstly, based on the observation that not every part of the document is equally informative for inferring the sen- timent orientation of the whole document, the concept of key sentiment sentence was defined. Secondly, taking advantage of three properties: sentiment purity, keyword property and position property, we construct heuristic rules, and combine with machine learning to extract key senti- ment sentence. Then, data is divided into key and detail views. Integrat- ing two views effectively can improve performance. Finally, experimental results show the superiority of our proposed method.
14:30-14:50	Research on the Visualization Method of Social Crowd Emotion Based on Microblog Text Data Analysis cuijuan liu, zhen liu, yanjie chai, hao fang and liangping liu show abstract hide abstract ABSTRACT: With the rapid development of Internet, the sentiment analysis of microblogs is becoming one of the important subjects in the study of the big data. Existing research works focus on the emotional tendency, which are lack of detailed description of all kinds of emotions. They can't intuitively reflect the emotional change of social groups. An emotional analysis method based on the combination of dependency parsing and artificial tagging was proposed, and facial expression animation to present emotions analysis was realized. The microblog crowd’s emotion in different areas for different social events was visualized. The experimental results showed that the model could closely and effectively simulate the crowd emotion, the research result could provide a new way of the analysis of network public opinion based on large data.
14:50-15:10	Sentiment Analysis based on User Tags for Traditional Chinese Medicine in Weibo Junhui Shen, Peiyan Zhu, Rui Fan, Wei Tan and Xueyan Zhan show abstract hide abstract ABSTRACT: With Western culture and science been widely accepted in China, Traditional Chinese Medicine (TCM) has become a controversial issue. So, it is important to study the public’s sentiment and opinions on TCM. The rapid development of online social network, such as twitter, make it convenient and efficient to sample hundreds of millions of people for the aforementioned sentiment study. To the best of our knowledge, the present work is the first attempt that applies sentiment analysis to the fields of TCM on Sina Weibo (a twitter-like microblogging service in China). In our work, firstly, we collected tweets topics about TCM from Sina Weibo, and labelled the tweets as supporting TCM or op- posing TCM automatically based on user tags. Then, a Support Vector Machine classifier was built to predict the sentiment of TCM tweets with- out tags. Finally, we presented a method to adjust the classifier results. The performance of F-measure attained by our method is 97%.
Shared Task: Oct. 12, 2015（13:30–15:10）, Treasure Palace Hotel-V18 Meeting Room（四楼V18）, Chair: Xiaojun WAN Return
13:30-13:50	Exploiting Heterogeneous Annotations for Weibo Word Segmentation and POS tagging Jiayuan Chao, Zhenghua Li and Min Zhang show abstract hide abstract ABSTRACT: This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (WB, 10K sentences), Penn Chinese Treebank 7.0 (CTB7, 50K), and People’s Daily (PD, 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine WB, CTB7, and PD, boosting F1 score from 93.76% (baseline model trained on only WB) to 95.58% (+1.82%). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert PD into the annotation style of CTB7 based on coupled sequence labeling, denoted by PDCTB. Then, we merge CTB7 and PDCTB to train a POS tagger, denoted by TagCTB7+PDCTB , which is further used to produce guide features on WB. Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).
13:50-14:10	Entity Recognition and Linking in Chinese Search Queries Jinwei Yuan, Yan Yang, Zhen Jia, Hongfeng Yin, Junfu Huang and Jie Zhu show abstract hide abstract ABSTRACT: Aiming at the task of Entity Recognition and Linking in Chinese Search Queries in NLP&CC 2015, this paper proposes the solutions to entity recognition, entity linking and entity disambiguation. Dictionary, online knowledge base and SWJTU Chinese word segmentation are used in entity recognition. Synonyms thesaurus, redirect of Wikipedia and the combination of improved PED (Pinyin Edit Distance) algorithm and LCS (Longest Common Subsequence) are applied in entity linking. The methods of suffix supplement and link value computation based on online encyclopedia are adopted in entity disambiguation. The experiment results indicate that the proposed solutions in this paper are effective for the case of short queries and insufficient contexts.
14:10-14:30	BosonNLP: An Ensemble Approach for Word Segmentation and POS Tagging Kerui Min, Chenggang Ma, Tianmei Zhao and Haiyan Li show abstract hide abstract ABSTRACT: Chinese word segmentation and POS tagging are arguably the most fundamental tasks in Chinese natural language processing. In this paper, we show an ensemble approach for segmentation and POS tagging, combining both discriminative and generative methods to get the advantage of both worlds. Our approach achieved the F1-score of 96.65% and 91.55% for segmentation and tagging respectively in the contest of NLPCC 2015 Shared Task 1, obtained the 1st place for both tasks.
14:30-14:50	Research on Open Domain Question Answering System Zhonglin Ye, Zhen Jia, Yan Yang, Junfu Huang and Hongfeng Yin show abstract hide abstract ABSTRACT: Aiming at open domain question answering system evaluation task in the fourth CCF Natural Language Processing and Chinese Computing Conference (NLPCC2015), a solution of automatic question answering which can answer natural language questions is proposed. Firstly, SPE (Subject Predicate Extraction) algorithm is presented to find answers from the knowledge base, and then WKE (Web Knowledge Extraction) algorithm is used to extract answers from search engine query result. Experimental data provided in the evaluation task includes the knowledge base and questions in natural language. The evaluation result shows that MRR is 0.5670, accuracy is 0.5700, and average F1 is 0.5240, and indicates the proposed method is feasible in open domain question answering system.
14:50-15:10	Shared Task Discussion
Chinese Language Computing I: Oct. 12, 2015（15:40–17:00）, Treasure Palace Hotel-Four Seasons（二楼四季）, Chair: Endong XUN Return
15:40-16:00	A New Ranking Method for Chinese Discourse Tree Building Yunfang Wu, Fuqiang Wan, Yifeng Xu and Xue Qiang Lv show abstract hide abstract ABSTRACT: This paper proposes a novel method for sentence-level Chinese discourse tree building. We construct a Chinese discourse annotated corpus in the framework of Rhetorical Structure Theory. And we propose a ranking-like SVM (SVM-R) model to automatically build the tree structure, which can capture the relative associated strength among three consecutive text spans rather than only two adjacent spans as most previous approaches do. The experimental results show that our proposed SVM-R method significantly outperforms the state-of-the-art in discourse parsing accuracy. We also demonstrate that the useful features for discourse tree building are consistent with Chinese language characteristics.
16:00-16:20	Research on the Sense Guessing of Chinese Unknown Words Based on “Semantic Knowledge-base of Modern Chinese” Fenfen Shang, Yanhui Gu, Weiguang Qu, Bin Li, Junsheng Zhou and Weiguang Qu show abstract hide abstract ABSTRACT: Semantically understanding words is an essential issue in understanding texts, since it is the basic techniques of understanding texts. However, there are plenty of unknown words. Therefore, it is difficult to make the users understand the content of the texts. We focus on the sense guessing of Chinese unknown words based on “Semantic Knowledge-base of Modern Chinese”. Firstly, we introduced different levels of semantic dictionary. Based on the new dictionaries, we introduced three models and predict sense. In this paper, we integrated each model to predict the unknown words and obtained better prediction performance. We also did work of semantic prediction and annotation of the unknown words in People's Daily which published in 2000 based on each model. Finally, we got the corpus resources with the sense annotation of unknown words.
16:20-16:40	Domain adaptive method based on active learning in Chinese word segmentation Huating Xu, YUJIE ZHANG, YANG XIAOHUI, Hua Shan, Jinan XU and Yufeng Chen show abstract hide abstract ABSTRACT: Chinese word segmentation systems, trained on annotated corpus of newspaper, often obviously decrease in performance when faced with a new domain. Since there is no large scale annotated corpus on the target domain, statistics based methods could not work well. In our approach of this paper, we attack domain adaptation of Chinese word segmentation by combining active learning with the statistical features of n-gram. Our idea is to select such a small amount of data for annotation that the gap from the target domain to the News will be overcome. The word segmentation model is trained again by the corpus added with newly annotated data and the statistical features of n-gram from the raw corpus. We use the CRF model for training and a raw corpus of one million sentences on patent description to verify the proposed approach. For test data, 300 sentences are randomly selected and manually annotated. The experimental results show that the performances of the Chinese word segmentation system based on our approach are improved on each evaluation metrics.
16:40-17:00	Context Vector Model for Document Representation: a Computational Study Yang Wei, Jinmao Wei and Hengpeng Xu show abstract hide abstract ABSTRACT: To tackle the sparse data problem of the bag-of-words model for document representation, the Context Vector Model (CVM) has been proposed to enrich a document with the relatedness of all the words in a corpus to the document. The nature of CVM is the combination of word vectors, wherefore the representation method for words is essential for CVM. A computational study is performed in this paper to compare the effects of the newly proposed word representation methods embed- ded in CVM. The experimental results demonstrate that some of the newly proposed word representation methods significantly improve the performance of CVM, for they estimate the relatedness between words better.
Machine Translation: Oct. 12, 2015（15:40–17:00）, Treasure Palace Hotel-No.5 Meeting Room（二楼五号）, Chair: Deyi XIONG Return
15:40-16:00	Research on Example-Based Phrase Pairs in Statistical Machine Translation Research on Example-Based Phrase Pairs in Statistical Machine Translation Qiang Li, Mu Li, Dongdong Zhang and Jingbo Zhu show abstract hide abstract ABSTRACT: Due to the sparsity of data and the limitation of bilingual data size, many high-quality phrase pairs can’t be generated. This paper generates example-based phrase pairs through decomposing, substituting and generating phrase pairs that generated by the typical phrase extraction method in phrase-based statistical machine translation. On the Chinese-to-English Newswire and Oral translation tasks, the experimental results demonstrate that our methods achieve significant improvements. Moreover, our methods yield a gain of about 1% BLEU score increase on some test sets.
16:00-16:20	A Selectional Preference Based Translation Model for SMT Haiqing Tang and Deyi Xiong show abstract hide abstract ABSTRACT: The existing phrase-based statistical machine translation (SMT) using rather limited semantic knowledge causing the translation quality of long-distance verb and its object is low. The authors propose a selectional preferences based translation model for SMT which inducts the semantic constraints that a verb imposes on its object to select the proper argument-head word for the predicate with long distance. First train the corpus to obtain the conditional probability based selectional preferences for verb. Then integrate the selectional preferences into a phrase-based translation system and evaluate on a Chinese-to-English translation task with large-scale training data. Experiment results show that the integration of selectional preference into SMT can effectively capture the long-distance semantic dependencies and improve the translation quality.
16:20-16:40	Entity Translation with Collective Inference in Knowledge Graph Qinglin Li, Shujie Liu, Rui Lin, Mu Li and Ming Zhou show abstract hide abstract ABSTRACT: Nowadays knowledge base (KB) has been viewed as one of the im- portant infrastructures for many web search applications and NLP tasks. How- ever, in practice the availability of KB data varies from language to language, which greatly limits potential usage of knowledge base. In this paper, we pro- pose a novel method to construct or enrich a knowledge base by entity translation with help of another KB but compiled in a different language. In our work, we concentrate on two key tasks: 1) collecting translation candidates with as good coverage as possible from various sources such as web or lexicon; 2) building an effective disambiguation algorithm based on collective inference approach over knowledge graph to find correct translation for entities in the source knowledge base. We conduct experiments on movie domain of our in-house knowledge base from English to Chinese, and the results show the proposed method can achieve very high translation precision compared with classical translation methods, and significantly increase the volume of Chinese knowledge base in this domain.
16:40-17:05*	Uyghur Text Automatic Segmentation Method Based on Inter-Word Association Degree Measuring Turdi Tohti, Winira Musajan and Askar Hamdulla show abstract hide abstract ABSTRACT: This paper puts forward a new idea and related algorithms for Uyghur segmentation. In this algorithm, the word based Bi-gram and contextual information are derived from large scale raw corpus automatically, and according to the Uyghur word association rules, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement to estimate the association strength between two adjacent Uyghur words, and found the weakly associated inter-word position and take it as a segmentation point obtained the perfect word strings both on its semantics and structural integrity .The experimental result on large-scale corpus shows that the proposed algorithm achieves 88.21% segmentation accuracy.
Evaluation/Open Fund Workshop(Chinese Session): Oct. 12, 2015(15:40–17:00), Treasure Palace Hotel-V18 Meeting Room（四楼V18）, Chair: Zhi TANG Return
15:40-16:00	NLPCC Task Debriefing Xiaojun WAN
16:00-16:20	云存储环境下海量隐私文本数据的安全产寻方法宋伟 show abstract hide abstract ABSTRACT: No
16:20-16:40	基于公式语义的数学搜索关键问题研究苏伟 show abstract hide abstract ABSTRACT: No
16:40-17:00	湘西民间苗文的字形动态生成方法及OpenType字库研究莫礼平 show abstract hide abstract ABSTRACT: No
Poster/Demo Presentations and Banquet: Oct. 12, 2015（17:30–21:10）, Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫）, Return
17:30-19:00	An Improved Algorithm of Logical Structure Reconstruction for Re-flowable Document Understanding Lin Zhao, Ning Li, Qi Liang and Xin Peng show abstract hide abstract ABSTRACT: The basic idea of re-flowable document understanding and automatic typesetting is to generate logical documents by judging the hierarchical relation- ship of physical units and logical tags based on the identification of logical para- graph tags in re-flowable document. In order to overcome the shortages of con- ventional logical structure reconstruction methods, a novel logical structure re- construction method of re-flowable document based on directed graph is pro- posed in this paper. This method extracts the logical structure from the template document and then utilizes directed graph's single-source shortest path algorithm to filter out redundant logical tags, thus solving the problem of logical structure reconstruction of a document. Experimental results show that the algorithm can effectively improve the accuracy of logical structure recognition.
17:30-19:00	A Local method for Canonical Correlation Analysis Ye Tengju, Xie Zhipeng and Li Ang show abstract hide abstract ABSTRACT: Canonical Correlation Analysis (CCA) is a standard statistical tech- nique for finding linear projections of two arbitrary vectors that are maximally correlated. In complex situations, the linearity of CCA is not applicable. In this paper, we propose a novel local method for CCA to handle the non-linear situa- tions.We aim to find a series of local linear projections instead of a single globe one. We evaluate the performance of our method and CCA on two real-world datasets. Our experiments show that local method outperforms original CCA in several realistic cross-modal multimedia retrieval tasks.
17:30-19:00	Fuzzy-Rough Set Based Multi-labeled Emotion Intensity Analysis for Sentence, Paragraph and Document Chu Wang, Shi Feng, Daling Wang and Yifei Zhang show abstract hide abstract ABSTRACT: Most existing sentiment analysis methods focus on single-label clas- sification, which means only a exclusive sentiment orientation (negative, posi- tive or neutral) or an emotion state (joy, hate, love, sorrow, anxiety, surprise, anger, or expect) is considered for the given text. However, multiple emotions with different intensity may be coexisting in one document, one paragraph or even in one sentence. In this paper, we propose a fuzzy-rough set based ap- proach to detect the multi-labeled emotions and calculate their corresponding intensities in social media text. Using the proposed fuzzy-rough set method, we can simultaneously model multi emotions and their intensities with sentiment words for a sentence, a paragraph, or a document. Experiments on a well- known blog emotion corpus show that our proposed multi-labeled emotion in- tensity analysis algorithm outperforms baseline methods by a large margin.
17:30-19:00	Chinese Calligraphy Alignment Based On 3D Point Set Registration Yingbin Liu, Yannan Sun and Endong Xun show abstract hide abstract ABSTRACT: Chinese calligraphy alignment is to establish correspondence between two Chinese characters by measuring the similarity by certain rules, and apply transformation accordingly. This paper presents an innovative method to align two glyph contours with three steps. First, 2D Bézier curve control points of glyph contours of each character are expanded into 3D space; Second, a Gaussian Mixture Model(GMM) is constructed using this 3D point set; Finally, we establish alignment by minimizing the Euclidean distance(L2) between two GMMs and then apply transformation accordingly. Expansion to 3D space helps make use of inherent constraints of Chinese calligraphy beyond 2D coordinates; The advantage of using Gaussian mixture model is to maintain both the overall shape property and the local writing features during the alignment process. Experiments were conducted to verify the feasibility and effectiveness of this method and the results show that it performed well for both single stroke and whole character.
17:30-19:00	A Star-graph-based Detection Method for Reflection Symmetry of Chinese Characters Yuan Liao, Xiaoqing Lu, Zhi TANG, Yongtao Wang and Jianling Sun show abstract hide abstract ABSTRACT: Symmetry is a significant structural property of Chinese characters. However, the limited texture features and the lack of efficient quantitative description methods hinder us from fully understanding and tapping into the symmetry of Chinese characters. This study proposes a symmetry detection method that combines different types of character features, such as scale invariant feature transform(SIFT) and contour information. A directed graph is constructed with the basic symmetric elements of a character to describe the enhancement relationships among the elements. Furthermore, the detection of the most significant axes of symmetry in one character is transformed into the problem of finding star subgraphs with local maximum weight. Experiments show that the proposed method outperforms the existing methods on the dataset we established.
17:30-19:00	Multi-Sentence Question Segmentation and Compression For Question Answering Yixiu Wang, Yunfang Wu and Xueqiang lv show abstract hide abstract ABSTRACT: We present a multi-sentence question segmentation strategy for community question answering services to alleviate the complexity of long sen- tences. We develop a complete scheme and make a solution to complex- question segmentation, including a question detector to extract question sen- tences, a question compression process to remove duplicate information, and a graph model to segment multi-sentence questions. In the graph model, we train a SVM classifier to compute the initial weight and we calculate the authority of a vertex to guide the propagating. The experimental results show that our meth- od gets a good balance between completeness and redundancy of information, and significantly outperforms state-of-the-art methods.
17:30-19:00	Deceptive Opinion Spam Detection using Deep Level Linguistic Features Changge Chen, Hai Zhao and Yang Yang show abstract hide abstract ABSTRACT: This paper focuses on improving a specific opinion spam detection task, de- ceptive spam. In addition to traditional word form and other shallow syntactic features, we introduce two types of deep level linguistic features. The first type of features are derived from a shallow discourse parser trained on Penn Discourse Treebank (PDTB), which can capture inter-sentence information. The second type is based on the relation- ship between sentiment analysis and spam detection. The experimental results over the benchmark dataset demonstrate that both of the proposed deep features achieve improved performance over the baseline.
17:30-19:00	An Entity Linking Approach Based on Topic-Sensitive Random Walk with Restart Li Maolin and Tan Yongmei show abstract hide abstract ABSTRACT: Entity Linking is the process of linking name mentions in text with their referent entities in a knowledge base. This paper tackles this task by proposingan approach based on topic-sensitive random walk with restart. Firstly, the context information of mentions are used to expand mentions and search the candidate entities in Wikipedia knowledge base for mentions; Secondly, graph can be constructed in accordance with the intermediate result in the pre step.Finally, the topic-sensitive random walk with restart model is used to rank the candidate entities and choose the top1 as the linked entity. Experimental results show that this approach on KBP2014 data set gets F score 0.623 which is higher than every other systems’ mentioned in this paper. The proposed approach can improve the Entity Linking system’s performance.
17:30-19:00	A User-Oriented Special Topic Generation System for Digital Newspaper Xi Xu, Mao Ye, Zhi TANG, Jianbo Xu and Liangcai Gao show abstract hide abstract ABSTRACT: With the coming of digital newspaper, user-oriented special topic generation becomes extremely urgent to satisfy the users’ requirements both functionally and emotionally. We propose an applicable automatic special topic generation system for digital newspapers based on users’ interests. Firstly, extract subject heading vector of the topic of interest by filtering out function words, localizing Latent Dirichlet Allocation (LDA) and training the LDA model. Secondly, remove semantically repetitive vector component by constructing a synonymy word map. Lastly, organize and refine the special topic according to the similarity between the candidate news and the topic, and the density of topic-related terms. The experimental results show that the system has both simple operation and high accuracy, and it is stable enough to be applied for user-oriented special topic generation in practical applications.
17:30-19:00	Building a Large-Scale Cross-Lingual Knowledge Base from Heterogeneous Online Wikis Mingyang Li, Yao Shi, Zhigang Wang and Yongbin Liu show abstract hide abstract ABSTRACT: Cross-LingualKnowledgeBasesareveryimportantforglobal knowledge sharing. However, there are few Chinese-English knowledge bases due to the following reasons: 1) the scarcity of Chinese knowledge in existing cross-lingual knowledge bases; 2) the limited number of cross- lingual links; 3) the incorrect relationships in semantic taxonomy. In this paper, a large-scale Cross-Lingual Knowledge Base(named XLORE) is built to address the above problems. Particularly, XLORE integrates four online wikis including English Wikipedia, Chinese Wikipedia, Baidu Baike and Hudong Baike to balance the knowledge volume in different languages, employs a link-discovery method to augment the cross-lingual links, and introduces a pruning approach to refine taxonomy. Totally, XLORE harvests 663,740 classes, 56,449 properties, and 10,856,042 in- stances, among of which, 507,042 entities are cross-lingually linked. At last, we provide an online cross-lingual knowledge base system support- ing two ways to access established XLORE, namely a search engine and a SPARQL endpoint.
17:30-19:00	Bilingually-Constrained Recursive Neural Networks with Syntactic Constraints for Hierarchical Translation Model Wei Chen and Bo Xu show abstract hide abstract ABSTRACT: Hierarchical phrase-based translation models have advanced statistical machine translation (SMT). Because such models can improve leveraging of syntactic information, two types of methods (leveraging source parsing and leveraging shallow parsing) are applied to introduce syntactic constraints in- to translation models. In this paper, we propose a bilingually-constrained recursive neural network (BC-RNN) model to combine the merits of these two types of methods. First we perform supervised learning on a manual- ly parsed corpus using the standard recursive neural network (RNN) model. Then we employ unsupervised bilingually-constrained tuning to improve the accuracy of the standard RNN model. Leveraging the BC-RNN model, we introduce both source parsing and shallow parsing information into a hier- archical phrase-based translation model. The evaluation demonstrates that our proposed method outperforms other state-of-the-art statistical machine translation methods for National Institute of Standards and Technology 2008 (NIST 2008) Chinese-English machine translation testing data.
17:30-19:00	Research on the extraction of Wikipedia-based Chinese-Khmer named entity equivalents Qing Xia, Xin Yan, Zhengtao Yu and Shengxiang Gao show abstract hide abstract ABSTRACT: Named entity equivalent has been playing a significant role in the processing of cross-language information. However limited by the corpora re- source, few in-depth studies have been made on the extraction of the bilingual Chinese-Khmer named entity equivalents. On account of this, this paper pro- poses a Wikipedia-based approach, utilizes the internal web links in Wikipedia and computes the feature similarity to extract the bilingual Chinese-Khmer named entity equivalents. The experimental result shows that good effect has been achieved when the entity equivalents are acquired through the internal web links in Wikipedia with F value up to 90.67%. Also it shows that the result is quite favorable when the bilingual Chinese-Khmer named entity equivalents are acquired through the computation of feature similarity, turning out that the me- thod proposed in this paper is able to give better effect.
17:30-19:00	Cross-Lingual Tense Tagging Based on Markov Tree Tagging Model Yijiang Chen, Tingting Zhu, Chang Su and xiaodong shi show abstract hide abstract ABSTRACT: In this paper, we transform the issue of Chinese-English tense con- version into the issue of tagging a Chinese tense tree. And then we propose Markov Tree Tagging Model to tag nodes of the untagged tense tree with Eng- lish tenses. Experimental results show that the method is much better than line- ar-based CRF tagging for the issue.
17:30-19:00	New Word Detection Based on an Improved PMI Algorithm for Enhancing Segmentation System Liping Du; Xiaoge Li; Gen Yu; Chunli Liu; Rui Liu show abstract hide abstract ABSTRACT: Chinese word segmentation is the fundamental of Chinese natural language processing and information extraction. With rapid development of Web 2.0 technology, internet new words recognition is the main problem and bottleneck for Chinese segmentation. We present an unsupervised method for identifying internet new words from the large scale web corpus, which combines with an improved Point-wise Mutual Information (PMI), PMIk algorithm, and some basic rules. This method can recognize internet new words with length from 2 to n (n could be defined as any number as needed). Experimented based on 257MB Baidu Tieba corpus, the precision of our system achieved 97.39% when the parameter value of PMIk algorithm is equal to 10, the precision increased 28.79% comparing to PMI method, the results show that our system is significant and efficient for detecting new word from the large scale web corpus. Compiling the results of new word discovery into user dictionary and then loading the user dictionary into ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System), experimented with 10KB Baidu Tieba corpus, the precision, the recall and F-Measure were promoted 7.93%、3.73% and 5.91% respectively comparing to ICTCLAS, the result show that new word discovery could improve the performance of segmentation for web corpus significantly. Key words New word recognition; Unknown word; PMI; Improved PMI algorithm; Chinese Word Segmentation
17:30-19:00	Chinese-Slavic Mongolian Named Entity Translation Based on Word Alignment Ping Yang; Hongxu Hou; Yupeng Jiang; Zhipeng Shen; Jian Du show abstract hide abstract ABSTRACT: Chinese to Slavic Mongolian Named Entity Translation in cross Chinese and Slavic Mongolian information processing has a very important significance. However, using the machine translation method directly cannot achieve satisfactory result. In order to solve the above problem, a novel approach was proposed to extract Chinese-Slavic Mongolian Named Entity pairs automatically. Only the Chinese named entities need to be identified, then extracting all of the candidate Named Entity pairs using sliding window method based on HMM word alignment result. Finally filtering all of the candidate Named Entity translation unit based on Max Entropy Model integrated with four features, and choose the most probable aligned Slavic Mongolian NE of the Chinese NE is. Experimental results show that this approach outperforms HMM model, achieves high quality of Chinese-Slavic Mongolian Named Entity pairs with relatively high precision, even though sometimes the word alignment result is partially correct.
17:30-19:00	Refining Kazakh Word Alignment Using Simulation Modelling Methods for Statistical Machine Translation Amandyk Kartbayev show abstract hide abstract ABSTRACT: Word alignment play an important role in the training of statistical machine translation systems. We present a technique to refine word alignments at phrase level after the collection of sentences from the Kazakh-English parallel corpora. The estimation technique extracts the phrase pairs from the word alignment and then incorporates them into the translation system for further steps. Although it is a pretty important step in training procedure, an word alignment process often has practical concerns with agglutinative languages. We consider an approach, which is a step towards an improved statistical translation model that incorpo- rates morphological information and has better translation performance. Our goal is to present a statistical model of the morphology dependent procedure, which was evaluated over the Kazakh-English language pair and has obtained an improved BLEU score over state-of-the-art models.
17:30-19:00	Bilingual Lexicon Extraction with Temporal Distributed Word Representation from Comparable Corpora chunyue zhang and Tiejun Zhao show abstract hide abstract ABSTRACT: Distributed word representation has been found to be highly effective to extract a bilingual lexicon from comparable corpora by a sim- ple linear transformation. However, polysemous words often vary their meanings at different time points in the corresponding corpora. A sin- gle word representation which is learned from the whole corpora can’t express the temporal change of the word meaning very well. This paper proposes a simple solution which exploits the temporal distributed word representation for polysemous words. The experimental results confirm that the proposed solution can offer better performance on the English- to-Chinese bilingual lexicon extraction task.
17:30-19:00	Learning to Rank Microblog Posts for Real-time Ad-hoc Search Jing Li, Zhongyu Wei, Hao Wei, Kangfei Zhao, Junwen Chen and Kam-Fai Wong show abstract hide abstract ABSTRACT: Microblogging websites have emerged to the center of infor- mation production and diffusion, on which people can get useful infor- mation from other users’ microblog posts. In the era of Big Data, we are overwhelmed by the large amount of microblog posts. To make good use of these informative data, an effective search tool is required specialized for microblog posts. However, it is not trivial to do microblog search due to the following reasons: 1) microblog posts are noisy and time-sensitive rendering general information retrieval models ineffective. 2) Convention- al IR models are not designed to consider microblog-specific features. In this paper, we propose to utilize learning to rank model for microblog search. We combine content-based, microblog-specific and temporal fea- tures into learning to rank models, which are found to model microblog posts effectively. To study the performance of learning to rank models, we evaluate our models using tweet data set provided by TERC 2011 and TREC 2012 microblogs track with the comparison of three state- of-the-art information retrieval baselines, vector space model, language model, BM25 model. Extensive experimental studies demonstrate the effectiveness of learning to rank models and the usefulness to integrate microblog-specific and temporal information for microblog search task.
17:30-19:00	What Causes Different Emotion Distributions of a Hot Event? A Deep Event-Emotion Analysis System on Microblogs Yanyan Zhao, Bing Qin and Ting Liu show abstract hide abstract ABSTRACT: Current online public opinion analysis systems can explore lots of hot events and present the public emotion distribution for each event, which are useful for the governments and companies. However, the public emotion distributions are just the shallow analysis of the hot events, more and more people want to know the hidden causation be- hind the emotion distributions. Thus, this paper presents a deep Event- Emotion analysis system on Microblogs to reveal what causes different emotions of a hot event. We here use several related sub-events to de- scribe a hot event in different perspectives, accordingly these sub-events combined with their different emotion distributions can be used to ex- plain the total emotion distribution of a hot event. Experiments on 15 hot events show that the above idea is reasonable to exploit the emotion causation and can help people better understand the evolution of the hot event. Furthermore, this deep Event-Emotion analysis system also tracks the amount treads and emotion treads of the hot event, and presents the deep analysis based on the user profile.
17:30-19:00	Rule-based Detection and Analysis of Annotation Errors in Dependency Treebank Linlin Shi; Likun Qiu; Shiyong Kang show abstract hide abstract ABSTRACT: Phrase structure tree and dependency tree are two forms of treebank, and can be converted to each other. In this paper, we try to transform dependency tree into phrase structure tree, and detect annotation errors automatically based on manual rules. The method has been used in processing Peking University Multi-view Chinese Treebank (PMT). Although PMT has been manually checked twice before processed by our method, 1529 errors were detected among the 50275 sentences and the precision is 100%. The errors mainly belong to three types: word segmentation error, mismatching between POS and syntactic role, and syntactic role error. This method can further improve treebank quality, and be applied to other dependency treebanks.
17:30-19:00	Resolving Coordinate Structures for Chinese Constituent Parsing Yichu Zhou, Shujian Huang, XIN-YU DAI and Jiajun CHEN show abstract hide abstract ABSTRACT: Coordinate structures are linguistic structures consisting of two or more conjuncts, which usually compose into larger constituent as a whole unit. How- ever, the boundary of each conjunct is difficult to identify, which makes it difficult to parse the whole coordinate and larger structures. In labeled data, such as the Penn Chinese Tree Bank (CTB), coordinate structures are not labeled explicitly, which makes solving the problem more complicated. In this paper, we treat re- solving coordinate structures as an independent sub-problem of parsing. We first define coordinate structures explicitly and design rules to extract the coordinate structures from labeled CTB data. Then a specifically designed grammar is pro- posed for automatic parsing of coordinate structures. We propose two groups of new features to better model coordinate structures in a shift-reduce parsing frame- work. Our approach can achieve a 15% improvement in F-1 score on resolving coordinate structures.
17:30-19:00	Document-level Machine Translation Evaluation Metrics Enhanced with Simplified Lexical Chain Zhengxian Gong and Guodong Zhou show abstract hide abstract ABSTRACT: Document-level Machine Translation (MT) has been draw- ing more and more attention due to its potential of resolving sentence- level ambiguities and inconsistencies with the benefit of wide-range con- text. However, the lack of simple yet effective evaluation metrics largely impedes the development of such document-level MT systems. This pa- per proposes to improve traditional MT evaluation metrics by simplified lexical chain, modeling document-level phenomena from the perspectives of text cohesion. Experiments show the effectiveness of such method on evaluating document-level translation quality and its potential of inte- grating with traditional MT evaluation metrics to achieve higher corre- lation with human judgments.
17:30-19:00	Similar Spatial Textual Objects Retrieval Strategy Yanhui Gu, Weiguang Qu, Yonggen Wang, Suoliang Jiang, Junsheng Zhou, Yunfei Long and Weiguang Qu show abstract hide abstract ABSTRACT: Effective and efficient retrieval of similar spatial textual objects plays an important role for many location based applications, such as Jiepang, Foursquare services, and so forth. Most of them focus on how to integrate spatial and textual information to efficiently retrieve top-k results yet few of them address the effectiveness issue. In this paper, we propose a semantic aware strategy which can effectively and efficiently retrieve the top-k similar spatial textual objects based on a general framework. Extensive experimental evaluation demonstrates that the performance of our proposal outperforms the state-of-the-art approach.
17:30-19:00	Mongolian Inflection Suffix Processing in NLP: A Case Study Xiangdong Su, Guanglai Gao, Yupeng Jiang, Jing Wu and Feilong Bao show abstract hide abstract ABSTRACT: Inflection suffix is an important morphological characteristic of Mongolian words, since the suffixes express abundant syntactic and semantic meanings. In order to provide an informative introduction of it, this paper im- plements a case study on it. Through three Mongolian NLP tasks, we disclose the following information: (1) views of inflection suffix in NLP tasks, (2) In- flection suffix processing ways, (3) Inflection suffix effects on system perform- ance and (4) some suffix related conclusion.
17:30-19:00	Weibo-oriented Chinese News Summarization via Multi-Feature Combination Maofu Liu, Limin Wang and Liqiang Nie show abstract hide abstract ABSTRACT: The past several years have witnessed the rapid development of so- cial media services, and the UGCs (User Generated Contents) have been in- creased dramatically, such as tweets in Twitter and posts in Sina Weibo. In this paper, we describe our system at NLPCC2015 on the Weibo-oriented Chinese news summarization task. Our model is established based on multi-feature combination to automatically generate summary for the given news article. In our system, we mainly utilize four kinds of features to compute the significance score of a sentence, including term frequency, sentence position, sentence length and the similarity between sentence and news article title, and then the summary sentences are chosen according to the significance score of each sen- tence from the news article. The evaluation results on Weibo news document sets show that our system is efficient in Weibo-oriented Chinese news summa- rization and outperforms all the other systems.
17:30-19:00	Linking Entities in Chinese Queries to Knowledge Graph Jun Li, Jinxian Pan, Chen Ye, Yong Huang, Zhichun WANG and Danlu Wen show abstract hide abstract ABSTRACT: This paper presents our approach for NLPCC 2015 shared task, Entity Recognition and Linking in Chinese Search Queries. The pro- posed approach takes a query as input, and generates a ranked mention- entity links as results. It combines several different metrics to evaluate the probability of each entity link, including entity relatedness in the given knowledge graph, document similarity between query and the vir- tual document of entity in the knowledge graph. In the evaluation, our approach gets 33.2% precision and 65.2% recall, and ranks the 6th among all the 14 teams according to the average F1-measure.
17:30-19:00	Word Segmentation of Micro Blogs with Bagging Zhengting Yu, XIN-YU DAI, Si Shen, Shujian Huang and Jiajun CHEN show abstract hide abstract ABSTRACT: This paper describes the model we designed for the Chinese word segmentation Task of NLPCC 2015. We firstly apply a word-based perceptron algorithm to build the base segmenter. Then, we use a Boot- strap Aggregating model of bagging which improves the segmentation results consistently on the three tracks of closed, semi-open and open test. Considering the characteristics of Weibo text, we also perform rule- based adaptation before decoding. Finally, our model achieves F-score 95.12% on closed track, 95.3% on semi-open track and 96.09% on open track.
17:30-19:00	A Hybrid Re-ranking Method for Entity Recognition and Linking in Search Queries Gongbo TANG, Yuting Guo, Dong Yu and Endong XUN show abstract hide abstract ABSTRACT: In this paper, we construct an entity recognition and linking system using Chinese Wikipedia and knowledge base. We utilize refined filter rules in entity recognition module, and then generate candidate entities by search en- gine and attributes in Wikipedia article pages. In entity linking module, we pro- pose a hybrid entity re-ranking method combined with three features: textual and semantic match-degree, the similarity between candidate entity and entity mention, entity frequency. Finally, we get the linking results by the entity’s final score. In the task of entity recognition and linking in search queries at NLPCC 2015, the Average-F1 value of this method achieved 61.1% in 3849 test dataset, which ranks second place in fourteen teams.
17:30-19:00	P-Trie Tree: A Novel Tree Structure for Storing Polysemantic Data Xin Zhou show abstract hide abstract ABSTRACT: Trie tree, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. It makes the search and update of words more efficient and is widely used in the construction of English dictionary for the storage of English vocabulary. Within the application of big data, efficiency determines the availability and usability of a system. In this paper, I introduce p-trie tree-a novel trie tree structure which can be used for polysemantic data which are not limited to English strings. I apply p-trie to the storage of Japanese vocabulary and evaluate the performance through experiments.
Chinese Language Computing II: Oct. 13, 2015（14:00–15:20）, Treasure Palace Hotel-Four Seasons（二楼四季）, Chair: Yansong FENG Return
14:00-14:20	A Benchmark for Stroke Extraction of Chinese Characters Xudong Chen, Zhouhui Lian, Jianguo Xiao and Yingmin Tang show abstract hide abstract ABSTRACT: How to automatically and accurately extract strokes from Chinese characters is one of the most important and challenging tasks in areas of document analysis, computer graphics and pattern recog- nition. However, currently no benchmark is available to evaluate the performance of stroke extraction methods. In this paper, we present a benchmark, which includes a manually-constructed database and evalu- ation tools, to address this problem. Specifically, the database contains a number of images of Chinese characters represented in four commonly- used font styles and corresponding stroke images manually segmented from character images. Performance of a given stroke extraction method can be evaluated by calculating dissimilarities of the automatic segmen- tation results and the ground truth using two specially-designed metrics. Moreover, we also propose a new method based on Delaunay triangula- tion to effectively extract strokes from Chinese characters. Experimental results obtained by comparing three algorithms demonstrate that the benchmark works well for the evaluation of stroke extraction approaches and the proposed method performs considerably well in the application of stroke extraction for Chinese characters.
14:20-14:40	An Adaptive Approach To Extract Characters From Digital Ink Text In Chinese Based On Extracted Errors Hao BAI show abstract hide abstract ABSTRACT: Extracting characters from digital ink text is an essential step which leads to more reliable recognition of text and also a prerequisite for structured editing. Casualness and diversity of handwriting input result in unsatisfied accu- racy of extracted characters. Reprocessing the initial extracted characters based on context makes some considerable improvement. Therefore, this paper pro- poses an approach to adaptively extracting characters from digital ink text in Chi- nese based on extracted errors. The approach firstly classified the extracted errors in the primary extraction. According to different types of extracted errors, the approach gives different operations. Experimental data shows that the approach is effective.
14:40-15:00	Integrating of Grapheme-based and Phoneme-based Transliteration Unit Alignment Method liu bojia, xu jinan, chen yufeng and zhang yujie show abstract hide abstract ABSTRACT: The main method of machine transliteration contents of the pheneme-based method and the grapheme-based method. Both of them retain certain limitations: The former one needs multi-steps to transfer pronunciation between different languages, which would increase errors. The other , due to ignoring the importance of the phoneme, which would cause loss of information issues. At the same time, the larger difficulty of machines transliteration between different language systems is the performance of segmentation and the accuracy of alignment of transliteration units in parallel corpus. Therefore, this paper proposed a new method for transliteration unit alignment which integrated the two main transliteration methods. Experimental results show that this method outperforms other methods in terms of performance in machine transliteration.
15:00-15:25*	A Dynamical Glyph Generation Method of Xiangxi Folk Hmong Characters and Its Implementation Approach Liping Mo and Kaiqing Zhou show abstract hide abstract ABSTRACT: To effectively solve the glyph generation and glyph description problem, a dynamical glyph generation method of Xiangxi folk Hmong characters is proposed. According to this method, the glyph generation process can be described as a combination arithmetic expression. Hmong characters component acts as the operand, and the location relationship between the components decides the operator. Glyphs in different structure can be dynamically generated by combination of two or three components. Further, if combination arithmetic expression is converted to ideographic description sequence (IDS), the proposed method can be implemented with the help of the IDS explain mechanism of operation system. Test results illustrate that, the Xiangxi Hmong characters glyph, which generated by the mapping script based on the proposed method, can meet practical requirements.
Event and Topic Models: Oct. 13, 2015（14:00–15:20）, Treasure Palace Hotel-No.5 Meeting Room（二楼五号）, Chair: Zhichun WANG Return
14:00-14:20	A Chinese Event Trigger Inference Approach Based on Markov Logic Networks Shaohua Zhu, Peifeng Li and Qiaoming Zhu show abstract hide abstract ABSTRACT: Currently, previous Chinese argument extraction approaches mainly focus on feature engineering and trigger expansion, which cannot exploit inner relation between trigger mentions in same document. To address this issue, this paper brings forward a novel trigger inference mechanism, which based on Markov Logic Network. It uses head morpheme, the probabilities of a trigger mention fulfilling true and pseudo events from the training set and the relationships between trigger mentions to those trigger mentions with lack of effective context information or low confidences. Experimental results on the ACE 2005 Chinese corpus show that our approach outperforms the baseline, with the improvements of 3.65% and 2.51% in trigger identification and event type classification respectively.
14:20-14:40	Global Inference for Co-reference Resolution between Chinese Events Jiayue Teng; Peifeng Li; Qiaoming Zhu show abstract hide abstract ABSTRACT: Currently, most pairwise resolution models for event co-reference focused on classification or clustering approaches, which ignored the relations between events in a document. This paper proposes a global optimization model for event co-reference resolution to resolve the inconsistent event chains in classifier-based approaches. This model regards co-reference resolution as a, Integer Linear Program problem and introduces various kinds constraints, such as symmetry, transitivity, triggers, argument roles, event distances, to further improve the performance. The experimental results show that our model outperforms the local classifier by 4.20% in F1-measure.
14:40-15:00	Constructing an Coherent Story Chain from Word Coverage Perspective jiabing fu and shoubin dong show abstract hide abstract ABSTRACT: Current studies merely focus on a story chain’s similarity of topic relationship and importance of documents, whilst almost ignoring its logical coherency and explainability. Along with algorithm complexity brought about by exponential growth in sets of news data, this paper attempts to construct a story chain from word coverage perspective, taking advantage of the story comments to position the turning point of each event, then using the ideas of similarity of topic relationship and sparsity differences as well as RPCA approach to conduct logical modeling for the documents, finally doing random walk and graph traversals to quantify and construct an explainable and logically coherent story chain. The double-blind experiment reveals that our method outperforms other algorithms.
15:00-15:20	Nonparametric Symmetric Correspondence Topic Models for Multilingual Text Analysis Rui Cai, Miaohong Chen and Houfeng WANG show abstract hide abstract ABSTRACT: Topic model aims to analyze collection of documents and has been widely used in the fields of machine learning and natural lan- guage processing. Recently, researchers proposed some topic models for multilingual parallel or comparable documents. The symmetric corre- spondence Latent Dirichlet Allocation (SymCorrLDA) is one such mod- el. Despite its advantages over some other existing multilingual topic models, this model is a classic Bayesian parametric model, thus can’t overcome the shortcoming of Bayesian parametric models. For example, the number of topics must be specified in advance. Based on this intu- ition, we extend this model and propose a Bayesian nonparametric model (NPSymCorrLDA). Experiments on Chinese-English datasets extracted from Wikipedia1 show significant improvement over SymCorrLDA.
NLP for Search: Oct. 13, 2015（14:00–15:20）, Treasure Palace Hotel-V18 Meeting Room（四楼V18）, Chair: Zhengtao YU Return
14:00-14:20	Incorporating Semantic Knowledge with MRF Term Dependency Model in Medical Document Retrieva Zhongda Xie and Yunqing Xia show abstract hide abstract ABSTRACT: Term dependency models are generally better than bag-of- word models, because complete concepts are often represented by mul- tiple terms. However, without semantic knowledge, such models may introduce many false dependencies among terms, especially when the document collection is small and homogeneous(e.g. newswire documents, medical documents). The main contribution of this work is to incorpo- rate semantic knowledge with term dependency models, so that more accurate dependency relations will be assigned to terms in the query. In this paper, experiments will be made on CLEF2013 eHealth Lab medical information retrieval data set, and the baseline term dependency mod- el will be the popular MRF(Markov Random Field) model [1], which proves to be better than traditional independent models in general do- main search. Experiment results show that, in medical document re- trieval, full dependency MRF model is worse than independent model, it can be significantly improved by incorporating semantic knowledge. Keywords: semantic knowledge, term dependency model, Markov Ran- dom Field, medical document retrieval
14:20-14:40	A Full-text Retrieval Algorithm for Encrypted Data in Cloud Storage Applications Wei Song, Zhiyong Peng and Yihui Cui show abstract hide abstract ABSTRACT: Nowadays, more and more Internet users use the cloud stor- age services to store their personal data, especially when the mobile devices which have limited storage capacity popularize. With the cloud storage services, the users can access their personal data at any time and anywhere without storing the data at local. However, the cloud storage service provider is not completely trusted. Therefore, the first concern of using cloud storage services is the data security. A straightforward method to address the security problem is to encrypt the data before uploading to the cloud server. The encryption method is able to keep the data secret from the cloud server, but cloud server also can not manipulate the data after encryption. It will greatly undermine the ad- vantage of the cloud storage. For example, a user encrypts his personal data before uploading them to the cloud. When he wants to access some data at the cloud, he has to download all the data and decrypt them. Obviously, this service mode will incur the huge overheads of commu- nication and computation. Several related works have been proposed to enable the search over the encrypted data, but all of them only sup- port the encrypted keyword search. In this paper, we propose a new full-text retrieval algorithm over the encrypted data for the scenario of cloud storage, in which all the words in a document have been ex- tracted and built a privacy-preserved full-text retrieval index. Based on the privacy-preserved full-text retrieval index, cloud server can execute full-text retrieval over the large scale encrypted documents. The numer- ical analysis and experimental results further validate the high efficiency and scalability of the proposed algorithm.
14:40-15:00	How Different Features Contribute to the Session Search? Jingfei Li, Dawei Song and Peng Zhang show abstract hide abstract ABSTRACT: Session search aims to improve ranking effectiveness by in- corporating user interaction information, including short-term interac- tions within one session and global interactions from other sessions (or other users). While various session search models have been developed and a large number of interaction features have been used, there is a lack of a systematic investigation on how different features would influence the session search. In this paper, we propose to classify typical interac- tion features into four categories (current query, current session, query change, and collective intelligence). Their impact on the session search performance is investigated through a systematic empirical study, under the widely used Learning-to-Rank framework. One of our key findings, d- ifferent from what have been reported in the literature, is: features based on current query and collective intelligence have a more positive influ- ence than features based on query change and current session. This would provide insights for development of future session search techniques.
15:00-15:20	Refine Search Results Based on Desktop Context Xiao-yun Li and Ying Yu show abstract hide abstract ABSTRACT: During a search task, a user’s search intention is possible inaccurate. Even with clear information need, it is probable that the search query cannot precisely describe the user’s need. And besides, the user is utterly impossible browse all the returned results. Thus, a selected and valuable returned search list is quite important for a search system. Actually, there are lots of reliable and highly relevant personal documents existing in a user’s personal computer. Based on the desktop documents, it is relevantly easy to understand the user’s current knowledge level about the present search subject, which is useful to predict a user’s need. An approach was proposed to exploit the potential of desktop context to refine the search returned list. Firstly, to attain a comprehen- sive long-term user model, the operational history and a series of time-related information were analyzed to achieve the attention degree that a user paid to a document. And the keywords and user tags were focused on to understand the content. Secondly, working scenario was regarded as the most valuable infor- mation to construct a short-term user model, which directly suggested what ex- actly a user was working on. Experiment results showed that desktop context could effectively help refine the search returned results, and only the effectively combination of the long-term user model and the short-term user model could offer more relevant items to satisfy the user.
NLP for Social Media: Oct. 13, 2015（15:40–17:00）, Treasure Palace Hotel-Four Seasons（二楼四季）, Chair: Bin WANG Return
15:40-16:00	Automatic Detection of Rumor on Social Network Qiao Zhang, Shuiyuan Zhang, Jian Dong, Jinhua Xiong and Xueqi Cheng show abstract hide abstract ABSTRACT: Therumordetectionproblemonsocialnetworkhasattracted considerable attention in recent years. Most previous works focused on detecting rumors by shallow features of messages, including content and blogger features. But such shallow features cannot distinguish between rumor messages and normal messages in many cases. Therefore, in this paper we propose an automatic rumor detection method based on the combination of new proposed implicit features and shallow features of the messages. The proposed implicit features include popularity orienta- tion, internal and external consistency, sentiment polarity and opinion of comments, social influence, opinion retweet influence, and match degree of messages. Experiments illustrate that our rumor detection method obtain significant improvement compared with the state-of-the-art ap- proaches. The proposed implicit features are effective in rumor detection on social network.
16:00-16:20	Multimodal Learning Based Approaches for Link Prediction in Social Networks Feng Liu, Bingquan Liu, Chengjie SUN, Ming Liu and Xiaolong Wang show abstract hide abstract ABSTRACT: The link prediction problem in social networks is to esti- mate the value of the link that can represent relationship between social members. Researchers have proposed several methods for solving link prediction and a number of features have been used. Most of these mod- els are learned with only considering the features from one kind of data. In this paper, by considering the data from link network structure and user comment, both of which could imply the concept of link value, we propose multimodal learning based approaches to predict the link val- ues. The experiment results done on dataset from typical social networks show that our model could learn the joint representation of these datas properly, and the method MDBN outperforms other state-of-art link pre- diction methods.
16:20-16:40	Predicting User Mention Behavior in Social Networks Bo Jiang, Ying Sha and Lihong Wang show abstract hide abstract ABSTRACT: Mention is an important interactive behavior used to explic- itly refer to target users for specific information in social networks. Un- derstanding user mention behavior can provide important insights into questions of human social behavior and improve design of social network platforms. However, most previous works mainly focus on mentioning for the effect of information diffusion, few researches consider the problem of mention behavior prediction. In this paper, we propose an intuitive approach to predict user mention behavior using link prediction method. Specifically, we first formulate user mention prediction problem as a clas- sification task, and then extract new features including semantic interest match, social tie, mention momentum and interaction strength to im- prove the performance of prediction. To evaluate the proposed approach, we conduct extensive experiments on Twitter dataset. The experimen- tal results clearly show that our approach has 15% increase in precision compared with the best baseline method.
16:40-17:00	Entity recognition and mining of online medical Q&A information Ya Su, Jie Liu and Yalou Huang show abstract hide abstract ABSTRACT: The authors design recognition features with the consideration of medical field characteristic for the online medical text, and the experiment of the entity recognition is carried out on the self-built data set. Concerned about five common diseases: gastritis, lung cancer, asthma, hypertension and diabetes. In the experiment, an advanced machine learning model Conditional Random Field is used for training and testing, and the target entities include five kinds: disease, symptoms, drugs, treatment methods and check. The effectiveness of the proposed features is verified by using the experimental method, and the accuracy of the total 81.26% is obtained and the recall rate is 60.18%. Subsequently, the further analysis is given for the recognition features.
Knowledge Acquisition and Applications: Oct. 13, 2015（15:40–17:00）, Treasure Palace Hotel-No.5 Meeting Room（二楼五号）, Chair: Guangyou ZHOU Return
15:40-16:00	Target Detection and Knowledge Learning for Domain Restricted Question Answering Mengdi Zhang, Tao Huang, Yixin Cao and Lei Hou show abstract hide abstract ABSTRACT: Frequent Asked Questions(FAQ) answering in restricted do- main has attracted increasing attentions in various areas. FAQ is a task to automated response user’s typical questions within specific domain. Most researches use NLP parser to analyze user’s intention and employ ontol- ogy to enrich the domain knowledge. However, syntax analysis performs poorly on the short and informal FAQ questions, and external ontology knowledge bases in specific domains are usually unavailable and expen- sive to manually construct. In our research, we propose a semi-automatic domain-restricted FAQ answering framework SDFA, without relying on any external resources. SDFA detects the targets of questions to assist both the fast domain knowledge learning and the answer retrieval. The proposed framework has been successfully applied in real project on bank domain. Extensive experiments on two large datasets demonstrate the effectiveness and efficiency of the approaches.
16:00-16:20	Mining RDF from Tables in Chinese Encyclopedias Weiming Lu, Zhenyu Zhang, Renjie Lou, Hao Dai, Shansong Yang and Baogang Wei show abstract hide abstract ABSTRACT: Web tables understanding has recently attracted a number of studies. However, many works focus on the tables in English, because they usually need the help of knowledge bases, while the existing knowledge bases such as DBpe- dia, YAGO, Freebase and Probase mainly contain knowledge in English.
16:20-16:40	Taxonomy Induction from Chinese Encyclopedias by Combinatorial Optimization Weiming Lu, Renjie Lou, Hao Dai, Zhenyu Zhang, Shansong Yang and Baogang Wei show abstract hide abstract ABSTRACT: Taxonomy is an important component in knowledge bases, and it is an urgent, meaningful but challenging task for Chinese taxonomy construction. In this paper, we propose a taxonomy induction approach from a Chinese ency- clopedia by using combinatorial optimizations. At first, subclass-of relations are derived by validating the relation between two categories. Then, integer program- ming optimizations are applied to find out instance-of relations from encyclope- dia articles by considering the constrains among categories. The experimental re- sults show that our approach can construct a practicable taxonomy from Chinese encyclopedias.
16:40-17:00	Research on the Construction of Bilingual Movie Knowledge Graph Weiwei Wang; Zhigang Wang; Liangming Pan; Yang Liu; Jiangtao Zhang show abstract hide abstract ABSTRACT: In this paper, we focus on the RDF triples extraction from tables in Chinese ency- clopedias. Firstly, we constructed a Chinese knowledge base through taxonomy mining and class attribute mining. Then, with the help of our knowledge base, we extracted triples from tables through column scoring, table classification and RDF extraction. In our experiments, we practically implemented our approach in 6,618,544 articles from Hudong Baike with 764,292 tables, and extracted about 1,053,407 unique and new RDF triples with an estimated accuracy of 90.2%, which outperforms other similar works.
NLP Applications Oct. 13, 2015(15:40–17:00), Treasure Palace Hotel-V18 Meeting Room（四楼V18）, Chair: Wenbin JIANG Return
15:40-16:00	Beyond Your Interests: Exploring the Information behind User Tags Weizhi Ma, Min Zhang, Yiqun Liu, Shaoping Ma and Lingfeng Chen show abstract hide abstract ABSTRACT: Tags have been used in different social medias, such as Delicious , Flickr, LinkedIn and Weibo. In previous work, considerable efforts have been made to make use of tags without identification of their different types. In this study, we argue that tags in user profile indicate three different types of informa- tion, say the basics (age, status, locality, etc), interests and specialty of a person. Based on this novel user tag taxonomy, we propose a tag classification approach in Weibo to conduct a clearer image of user profiles, which makes use of three categories of features: general statistics feature (including user links with follow- ers and followings), content feature and syntax feature. Furthermore, different from many previous studies on tag which concentrate on user specialties, such as expert finding, we find that valuable information can be discovered with the ba- sics and interests user tags. We show some interesting findings in two scenarios, including user profiling with people come from different generations and area profiling with mass appeal, with large scale tag clustering and mining in over 6 million identical tags with 13 million users in Weibo data.
16:00-16:20	Stochastic Language Generation using Situated PCFGs Caixia Yuan, Xiaojie Wang and Ziming Zhong show abstract hide abstract ABSTRACT: This paper presents a purely data-driven approach for gener- ating natural language (NL) expressions from its corresponding seman- tic representations. Our aim is to exploit a parsing paradigm for natural language generation (NLG) task, which first encodes semantic represen- tations with a situated probabilistic context-free grammar (PCFG), then decodes and yields natural sentences at the leaves of the optimal parsing tree. We deployed our system in two different domains, one is response generation for a Chinese spoken dialogue system, and the other is instruc- tion generation for a virtual environment in English language, obtaining results comparable to state-of-the-art systems both in terms of BLEU scores and human evaluation.
16:20-16:40	Recognition of Person Relation Indicated by Predicates Zhongping Liang, Caixia Yuan, Bing Leng and Xiaojie WANG show abstract hide abstract ABSTRACT: This paper focuses on recognizing person relations indicated by predicates from large scale of free texts. In order to determine whether a sentence contains a potential relation between persons, we cast this problem to a classification task. Dynamic Convolution Neural Network (DCNN) is improved for this task. It uses frame convolution for making uses of more features efficiently. Experimental results on Chinese per- son relation recognition show that the proposed model is superior when compared to the original DCNN and several strong baseline models. We also explore employing large scale unlabeled data to achieve further im- provements.
16:40-17:00	Multiple-choice Question Answering Based on Textual Entailment Wang Baoxin, Zheng Dequan, Wang Xiaoxue, Zhao Shanshan and Zhao Tiejun show abstract hide abstract ABSTRACT: This paper proposes a method to compute textual entailment strength, taking multiple-choice questions appearing in people's life and study as research objects as they have clear candidate answers, aiming at the phenomenon of long text entailing short text. And two methods are used to answer the college entrance examination geography multiple-choice questions based on the Wikipedia Chinese Corpus in the absence of large-scale questions and answers, one is based on the sentence similarity and the other is based on the textual entailment proposed in this paper. The accuracy rate of the proposed method is 36.93%, increasing by 2.44% than the way based on the word embedding sentence similarity, increasing 7.66% than the way based on the Vector Space Model sentence similarity, which confirm the effectiveness of the method based on the textual entailment.
15:40-16:00	Mining Personalized Features for Rating Prediction Ma Chunping and Chen Wenliang show abstract hide abstract ABSTRACT: Recommender systems have emerged as a powerful tool of e-business to narrow the gap between customers and providers. Based on users’ historic behavior, the traditional recommender algorithms predict the preference of user to item. Thanks to rapid development of the Internet, more and more users share the experience and wisdom through online reviews, which become a hot research topic for recommender system. This paper proposes an approach to mining the opinions of users to build a personalized model for each user or item. The experimental results on a real data set show that the proposed approach can improve the accuracy of rating prediction.

Innovation Demo

Innovation Demo: Oct. 12~13, 2015, Treasure Palace Hotel-Palace Grand Ballroom（碧丽宫） Return
Organizations	Booths / Contacts	Introduction
Beijing Sogou Technology Development Co., LTD	Booth A, LIU, Mingrong	Key technical structure and methods in natural language search Sogou search (www.sogou.com), launched by Sohu.com Inc. (NASDAQ:SOHU) in 2004, is the world’s first 3rd generation interactive Chinese search engine. Innovative artificial intelligent algorithms has been applied in this search engine to grasp internet users’ potential needs and thus help users to find information and services more accurately and efficiently. The 11 year’s accumulation of technologies and resources makes it possible for Sogou search to promote more and more innovative spots. For example，Sogou launched Chinese knowledge graph “ZhiLiFang” in 2012 firstly among domestic search engines. Especially after the strategic collaboration with Tencent Holdings Ltd (HKG:0700), Sogou search continuously launched several new products, which includes Wechat Search, Wechat Headlines. That’s also a strategy taken by Sogou search to cover more internet resources rather than web pages only and wins the leadership in competitive differentiation.
Gridsum	Booth B, ZHAO, Xiaotong	Gridsum Law Dissector is a pioneering analytic tool which applies data science methodology in the legal industry. With the extensive practice of Chinese natural language processing and industry insight data modeling, Law Dissector summarizes the key attributes of each legal case and aggregate them in a broader statistical scope, which enables judges, lawyers and researchers to find implied patterns or rules that would be valuable for their respective professional objectives. The main features of Law Dissector includes: Unprecedented information retrieval ability. By intimately combining legal professional insight with Chinese NLP technologies, Law Dissector is able to develop a deep understanding of various kinds of legal documents. Powerful and insightful aggregations. Upon the modeling of each particular legal document, Law Dissector groups these extracted data in a multi-dimensional fashion, and allows for in-depth statistical legal research. Intelligent case analysis. Law Dissector allows for the input of a specific question, a paragraph of case description, or even an entire plaintiff complaint.
Mainbo Education	Booth C, WANG, Junming	U- Class Digital Teaching application system The U-class Digital Teaching APP V2 is a digital lesson preparation and teaching software designed specifically for primary and middle school teachers. Based on teaching practice, it integrates copyrighted digital teaching materials, massive teaching resources and commonly-used lesson preparation and teaching tools to help teachers realize day-to-day application of digital teaching practices, reduce time spent on lesson preparation and improve teaching effectiveness. Meanwhile, it provides institutions with administration terminal software to support administration at the provincial, municipal, district, county and school levels and to help educational institutions achieve digital teaching management while promoting development and sharing of teaching resources at the regional level. U-class smart teaching system V3 U-class smart teaching system V3, the first teaching application system centered on copyrighted teaching materials ever created in China, is designed as a tool and service platform for IT-based synchronized teaching application in primary school classrooms. This system is composed of a cloud service platform and key client software for teachers, students and institutions. With massive teaching resources and wide-ranging client applications, it provides teaching application, teacher-student interaction and resource management and sharing for end users, while building a smart, efficient, open and easy-to-use teaching application platform for educational institutions rapidly.
State Key Laboratory of Digital Publishing Technology（Founder Group）	Booth D, HUANG, Xiaojun	Ubiquitous Document Technology(CEBX) and Its Applications CEBX is a ubiquitous document technology that combines fixed-layout information and fluid information. CEBX can solve the problems caused by the variety reading devices, one CEBX document can be read on personal computer, mobile phone, Pad, ebook readers and tablet PC, etc., so it support designing and composing once while publishing anywhere. CEBX can display or print the document in its original fixed layout, and also can be reflowed to fit the mobile device basing on the fluid information and real-time page layout algorithm. CEBX has applied in Founder Apabi’s products Chinese Digital Library which has been given to overseas institutions by general secretary Xi Jinping and other state leaders 10 times as a national present; CEBX is also applied to many digital publishing products of Founder Apabi such as YUEZHI multiplatform digital-reading solutions、XUEZHI searching、XUEZHI classroom systems.
Speechocean Limited	Booth E, XIN, Xiaofeng	KingLine Data Center Kingline Data Center (KingLineTM) is operated and supervised by Speechocean, which is mainly focused on language resources creating and providing for research and development of human language technology. These diversified corpora are widely used for the research and development in the fields of Speech Recognition, Speech Synthesis, Natural Language Processing, Machine Translation, Web Search, etc. All corpora are openly accessible for users all over the world, including users from scientific research institutions, enterprises or individuals. Members are encouraged to share or distribute high quality data resources and get KingLine Credits or cash income.
Microsoft Research Asia	Booth F, ZHANG, Dongdong, LIU, Shujie	Advanced Products and Technologies on NLP The information era has brought us vast amounts of digitized text that are generated, propagated, exchanged, stored, and accessed through the internet each day across the world. The accumulation of this data is making information acquisition increasingly difficult, with language becoming a critical obstacle to growth. To overcome these difficulties, Microsoft Research is focusing its efforts on a variety of research topics including multi-language text analysis, machine translation, cross language information retrieval, text mining of big web, social and enterprise, question answering with web, knowledge base and social repositories, and various applications of utilizing NLP technology for search engine, office and cloud computing, ESL learning and writing, chat-bot and conversation systems. Recently, we have made a series of meaningful exploration applying deep learning to the typical NLP tasks such as machine translation, sentiment analysis, question-answering and chat-bot, trying to reconstruct the NLP methodology foundation.
Beijing Information and Technology University	Booth G, ZHANG, Yangsen	Updated Progresses in Natural Language Processing & Chinese Computing Beijing Information Science & Technology University (BISTU) is a public university providing both full-time and continuing education from China and abroad. Enjoying the firm support of the Beijing Municipal Government, we take a coordinated approach to academic development in diverse disciplines ranging from information science and engineering. Some featured research directions in computer science including: artificial intelligence, natural language processing, document information processing, etc. The main research result exhibit here is about: (1)Chinese Corpus Construction and Application; (2)Text Error Detection and Error Correcting Suggestion; (3)Intelligent Management System for Stereoscopic WareHouse; (4)Document Format Verification; (5)Document Collaborative Editing System: DAVOffice; (6)Patent Mining. The relevant systems have been used by several organizations and companies.
Tsinghua University, Knowledge Engineer Group，Department of Computer Sicence and Technology	Booth H, HUANG, Tao	Big Scientific Data Mining And Services Knowledge Engineering Group aims to studying the theories, methods and tools about Knowledge Engineering on the Web. The research content includes Social Network and News Mining, Semantic Web. More than one hundred papers have been published on top international conferences and journals such as SIGKDD, WWW, TKDE, ACL etc. The funded projects are from the Natural Science Foundation of China (NSFC), 973 national basic research programs, 863 high technology programs, and various industrial and international cooperation projects. In this exhibition, we will introduce our research and application systems in posters and demonstrations(Aminer and Newsminer).
Jiangxi Normal University	Booth J, LI, Maoxi	Jiangxi Normal University (JXNU) Jiangxi Normal University (JXNU), founded in 1940 originally under the name National Chung Cheng University, has been committing to the undergraduates program in 75 years. JXNU is a research and teaching university, with 3 campuses, 1 national research center of engineering and technology, 1 national university science park, 2 international cooperative bases of science and technology, 1 key laboratory of General Administration of Sport, 1 state-level technology service platform, 2 key laboratories of Ministry of Education, 4 post-doctoral research stations, 3 first-level discipline doctorate degree programs, 26 second-level discipline doctorate degree programs, 26 first-level discipline master’s degree programs, 129 second-level discipline master’s degree programs. The university has now over 40,000 full-time students, including 4,500 postgraduates. JXNU always pays high attention to the teaching, research and social service. With the concern and support of the Provincial Education Committee and the Education Department of Jiangxi province, the university has made constant breakthroughs in the researches. It won a second prize of National Scientific and Technological Progress Award in 2010, and the annual total expenditure in 2011 of the research projects was almost 100 million yuan. A key laboratory of the Ministry of Education and the second state engineering research center of the national normal universities have been conferred to establish in this year, which was a breakthrough from zero on the state-level research platform. It has become the second local normal university in China joint-sponsored by the Ministry of Education and Jiangxi Provincial Government and entered the project of constructing the basic capacities of universities in middle and western China in 2012. Its teaching experiment center was approved to be the national experimental teaching demonstration center in 2013. A second prize of state-level “Outstanding Achievements in Education” was awarded in 2014. The number of approved national social science fund projects and national nature science fund projects of the university ranked respectively the 7th and the 6th in the normal universities nationwide in 2015 and the development of the university has been propelled to a new level. Founded in 1973, originally the teaching and research section of computer in Math Department, the School of Computer and Information Engineering was the earliest unit engaging in the teaching and researches of computer. The Department of Computer Science which was founded in 1985 began to enroll undergraduates majoring in computer and had the master’s degree program of Computer Software and Theories in 1993. Jiangxi Normal University thus became the earliest having the right to confer master’s degree of computer in Jiangxi province. Presently, the school has 2 one-level discipline master’s degree programs, 3 profession master’s degree programs, 3 provincial level engineering and technological centers and key laboratories, 1 provincial level collaborative innovation center. Based on the provincial key disciplines, the school has established a system of discipline development and the research platform with distinct features and unique advantages. In recent five years, over 50 national scientific research projects were undertaken by the school. It now has 1,400 enrolled students and 83 professional teachers, including 14 professors, 36 associate professors. The main research fields are as follows: intelligent information processing, software theory and credibility, the internet of things and its applications, education measurement and information processing.
Kunming University of Science and Technology	Booth K, YU, Zhengtao	Key Laboratory of Intelligent Information Processing Key Laboratory of Intelligent Information Processing at Kunming University of Science and Technology mainly researches on natural language processing, information retrieval and machine translation. Being engaged in a series of works on language information processing of small languages, it constructed basic morphology and syntax of Southeast Asian languages and bilingual alignment corpus between Chinese and Southeast Asian languages; it developed some systems about Vietnamese, Thai, Cambodian, Burmese and Lao, such as word segmentation, part-of-speech tagging, name entity recognition, cross-language retrieval, public opinion analysis system and so on; and it also constructed Chinese-Vietnamese, Chinese-Thai machine translation system based on phrase model.

NLPCC 2015 Campus Open Day

Oct. 13, 2015(14:00-17:00), Xiansu Building, Jiangxi Normal University （江西师范大学，先骕楼） Return
Product & Technology Demo	The hall in the second floor of Xiansu Building（二楼大厅）	◆Booth 1: National Key Lab of Digital Publishing Tech.(Founder ◆Booth 2: Speechocean Limited ◆Booth 3: Microsoft Research ◆Booth 4: Beijing Information and Technology University
Technology & Achievements Talk	The meeting room in the fourth floor of Xiansu Building（四楼会议室）	◆14:30-15:00，National Key Lab of Digital Publishing Tech. (Founder Group) ◆15:00-16:00，How to Do Research Jobs, HUANG Changning, Professor, Tsinghua University

Organizer:	Hosts:
Publishers:		In-corporation with:
Sponsors: