Sunday, December 7, 2014 |
14:30–21:00 |
Kylin Villa Hotel(深圳麒麟山庄), 1st Floor |
Registration |
17:00–20:00 |
Kylin Villa Hotel-WUTONG HALL |
Dinner(Buffet) |
19:30–21:30 |
Kylin Villa Hotel- FENGHUANG HALL |
TCCI Business Meeting(Only TCCI Members) |
Monday, December 8, 2014 |
08:00–08:50 |
Kylin Villa Hotel(深圳麒麟山庄), 1st Floor |
Registration |
08:50–09:10 |
Kylin Villa Hotel-KYLIN HALL |
Opening Ceremony Session Chair: CHEN Qingcai |
09:10–10:10 |
Invited Talk 1, by Prof. Hwee Tou Ng Automated Grammatical Error Correction: The State of the Art |
Kylin Villa Hotel-KYLIN HALL |
Session Chair: NIE Jianyun |
show abstract
hide abstract
Abstract: There is recent increased interest in automated grammatical error correction, which automatically detects and corrects grammatical errors in essays written by learners of English. In particular, four shared tasks within the natural language processing community were organized in the past four years (HOO 2011, HOO 2012, CoNLL 2013, and CoNLL 2014). In this talk, Prof. Ng will give an overview of the two recent CoNLL shared tasks where he was the lead organizer, and present the findings of the shared tasks. Prof. Ng will review recent approaches that achieve state-of-the-art performance in grammatical error correction, and suggest possible future research directions.
show biography
hide biography
Keynote Speaker: Prof. Hwee Tou Ng Dr. Hwee Tou Ng is Provost's Chair Professor of Computer Science at the National University of Singapore (NUS) and a Senior Faculty Member at the NUS Graduate School for Integrative Sciences and Engineering. He received a PhD in Computer Science from the University of Texas at Austin, USA. His research focuses on natural language processing and information retrieval. He is a Fellow of the Association for Computational Linguistics (ACL).
He has published papers in premier journals and conferences, including Computational Linguistics, Journal of Artificial Intelligence Research (JAIR), ACM Transactions on Information Systems (TOIS), ACL, EMNLP, SIGIR, AAAI, and IJCAI. His papers received the Best Paper Award at EMNLP 2011 and SIGIR 1997. He is an action editor of the Transactions of the Association for Computational Linguistics (TACL), an editorial board member of Natural Language Engineering, and a steering committee member of ACL SIGNLL.
He has also served as the Editor-in-Chief of ACM Transactions on Asian Language Information Processing (TALIP) (May 2007 − May 2013) and an editorial board member of Computational Linguistics (2004 − 2006) and Journal of Artificial Intelligence Research (JAIR) (Sep 2008 − Aug 2011). He was an elected member of the ACL executive committee (2008 − 2010) and a former secretary of ACL SIGNLL. He was program co-chair of EMNLP 2008, ACL 2005, and CoNLL 2004 conferences, and has served as area chair of ACL, EACL, EMNLP, SIGIR, and AAAI conferences and as session chair and program committee member of many past conferences including ACL, EMNLP, SIGIR, AAAI, and IJCAI.
|
10:10–10:30 |
Addr(TBD) |
Group Photo |
10:30–10:50 |
Kylin Villa Hotel-KYLIN HALL |
Coffee/Tea Break |
10:50–11:50 |
Invited Talk 2, by Prof. Eduard Hovy Progress in Computational Semantics: Structured Distributions and their Applications |
Kylin Villa Hotel-KYLIN HALL |
Session Chair: LIN Dekang |
show abstract
hide abstract
Abstract: Over the past few years, many NLP researchers have turned toward Distributional Semantics models and Deep Learning to overcome problems that traditional propositional models of semantics simply cannot handle.
Distributional Semantics (DS) assumes that [much of] the semantics of a word can be captured by a distribution of words associated with it and represented as a vector. But DS models also have shortcomings, noticeably in compositionality: it is unclear how (and whether) one can (or should) compose two vectors to arrive at a third, in such a way that the product also reflects [much of] the semantics of the compound.
Deep learning methods such as Recursive Neural Nets are one way to perform composition that seems to work. But certain phenomena are still beyond the reach of deep learning. Dr. Hovy postulate that decomposing the vector into a tensor, by representing separately the distributions of words associated with specific relations off the target word, provides additional representational power. The result, which we call the Structured Distributional Semantic Model (SDSM), is the basis of a series of experiments on automatically discovering the (latent) relations, building and compressing the tensor space, meaning the similarity between two structured distributed representations, and applying the model to perform wowed similarity detection, coreference resolution, anomaly detection, and other tasks.
show biography
hide biography
Keynote Speaker: Prof. Eduard Hovy
Eduard Hovy is a member of the Language Technology Institute in the School of Computer Science at Carnegie Mellon University. He holds adjunct professorships at universities in China and Canada, and is co-Director of Research for the DHS Center for Command, Control, and Interoperability Data Analytics, a distributed cooperation of 17 universities. Dr. Hovy completed a Ph.D. in Computer Science (Artificial Intelligence) at Yale University in 1987, and was awarded an honorary doctorate from the National Distance Education University (UNED) in Madrid in 2013. From 1989 to 2012 he directed the Human Language Technology Group at the Information Sciences Institute of the University of Southern California.
Dr. Hovy’s research addresses several areas in Natural Language Processing, including machine reading of text, question answering, information extraction, automated text summarization, the semi-automated construction of large lexicons and ontologies, and machine translation. His contributions include the co-development of the ROUGE text summarization evaluation method, the BLANC coreference evaluation method, the Omega ontology, the Webclopedia QA Typology, the FEMTI machine translation evaluation classification, and a model of Structured Distributional Semantics.
Dr. Hovy is the author or co-editor of six books and over 300 technical articles and is a popular invited speaker.
In 2001 Dr. Hovy served as President of the Association for Computational Linguistics (ACL) and in 2001–03 as President of the International Association of Machine Translation (IAMT). Dr. Hovy regularly co-teaches courses and serves on Advisory Boards for institutes and funding organizations in Germany, Italy, Netherlands, and the USA.
|
11:50–13:30 |
Kylin Villa Hotel-WUTONG HALL |
Lunch(Buffet) |
13:30–15:10 |
Kylin Villa Hotel- FENGHUANG HALL |
Kylin Villa Hotel-MUMIAN HALL |
Kylin Villa Hotel-ZIJING HALL |
Fundamentals |
Machine Translation |
Shared Task |
15:10–15:30 |
Addr(TBD) |
Coffee/Tea Break |
15:30–17:10 |
Kylin Villa Hotel- FENGHUANG HALL |
Kylin Villa Hotel-MUMIAN HALL |
Kylin Villa Hotel-ZIJING HALL |
Machine Learning for NLP |
CIT Applications 1 |
Evaluation Workshop(Chinese Session) |
17:30–21:00 |
Kylin Villa Hotel-KYLIN HALL |
Poster/Demo Presentations and Banquet, Innovation Demo |
Tuesday, December 9, 2014) |
08:30–09:30 |
Invited Talk 3, by Prof. LIU Bing Sentiment Analysis and Lifelong Machine Learning |
Kylin Villa Hotel-KYLIN HALL |
Session Chair: ZHOU Guodong |
show abstract
hide abstract
Abstract: Sentiment analysis (SA) or opinion mining is the computational study of people’s opinions, sentiments, attitudes, and emotions. Due to almost unlimited applications and numerous research challenges, SA has been a very active research area in natural language processing (NLP) and data mining. SA is regarded as a semantic analysis problem, but is also highly targeted and bounded because a SA system does not need to fully “understand” a sentence or document. It only needs to comprehend some aspects of its meaning, e.g., positive/negative opinions and their targets.
Due to this targeted nature, it allows us to perform deeper language analyses to gain better insights into NLP than in the general setting because the complexity of the general setting of NLP is just too overwhelming. Thus, although general NL understanding is still far from us, we may be able to solve the SA problem satisfactorily. In this talk, I will first introduce SA and its challenges, and then go into detail to discuss a recent study that aims to solve a SA problem but also contributes significantly to machine learning in the areas of lifelong learning and topic modeling.
show biography
hide biography
Keynote Speaker: Prof. LIU Bing
Bing Liu is a professor of Computer Science at the University of Illinois at Chicago (UIC). He received his PhD in Artificial Intelligence from the University of Edinburgh. Before joining UIC, he was a faculty member at the National University of Singapore. His current research interests include sentiment analysis and opinion mining, data mining, machine learning, and natural language processing (NLP). He has published extensively in top conferences and journals. He is also the author of two books: “Sentiment Analysis and Opinion Mining” (Morgan and Claypool) and “Web Data Mining:
Exploring Hyperlinks, Contents and Usage Data” (Springer). In addition to his research contributions, his work has also made important social impacts.
Some of his work has been widely reported in the press, including a front-page article in The New York Times. On professional services, Liu has served as program chairs of many leading data mining related conferences of ACM, IEEE, and SIAM: KDD, ICDM, CIKM, WSDM, SDM, and PAKDD, as associate editors of several leading data mining journals, e.g., TKDE, TWEB, DMKD, and as area/track chairs or senior technical committee members of numerous NLP, data mining, and Web technology conferences. He currently also serves as the Chair of ACM SIGKDD, and is an IEEE Fellow.
|
09:30–10:20 |
Best Papers |
Kylin Villa Hotel-KYLIN HALL |
Session Chair: ZONG Chengqing |
10:20–10:40 |
Addr(TBD) |
Coffee/Tea Break |
10:40–12:10 |
Panel: The Opportunity and Challenge for AI in Big Data Era |
Kylin Villa Hotel-KYLIN HALL |
Moderator: ZHOU Ming |
Show Details
Hide Details
Panel Info:
We are living in the exciting era where big data, cloud computing, mobile internet and social network joint force to drive new innovations of various kinds to change people’s life and work style. Motivated by this new trend, people have renewed their desire for artificial intelligence (AI). AI is the ability that makes a computer system listen, speak, read, write, search, make decision, answer questions and solve questions. Almost all current great innovations including search engine, “brain”projects, speech assistant and spoken translator are supported by big data and internet and powered by machine learning and deep learning to guide its theory and methodology. AI systems rely on the powerful capability of natural language understanding and generation, which is the core of the theme of NLPCC conference. Therefore, NLPCC 2014 organizes this panel session on The Opportunity and Challenge for AI in Big Data Era during the regular program. In this panel session, we will cover the hot topics about noteworthy progress of AI and NLP, deep learning, knowledge graph and their applications to NLP tasks. The panelist will present their viewpoints from various perspectives on the opportunities and challenges we will face and audience will be invited to have face-to-face discussion with panelist. We hope that this panel session will inspire everybody in the conference to generate thoughtful ideas on new innovations and help to drive the state of the art of the field of NLP and Chinese computing.
The panel list includes:
Xuanjing Huang Fudan University
Dekang Lin Google
Bing Liu University of Illinois at Chicago
Jianyun Nie University of Montreal
|
12:10–13:30 |
Kylin Villa Hotel-WUTONG HALL |
Lunch(Buffet) |
13:30–15:10 |
Kylin Villa Hotel- FENGHUANG HALL |
Kylin Villa Hotel-MUMIAN HALL |
Kylin Villa Hotel-ZIJING HALL |
Sentiment Analysis |
Information Extraction |
CIT Applications 2 |
15:10–15:30 |
Addr(TBD) |
Coffee/Tea Break |
15:30–17:10 |
Kylin Villa Hotel- FENGHUANG HALL |
Kylin Villa Hotel-MUMIAN HALL |
Kylin Villa Hotel-ZIJING HALL |
NLP for Social Media |
IR & QA |
Open Fund & Expo(Chinese Session) |
14:00–17:00 |
Building A, 5th Floor (A栋5楼), Shenzhen Graduate School, HIT |
NLPCC Campus Open Day |
Best Paper:
Dec. 9, 2014(09:30–10:20),
Kylin Villa Hotel-KYLIN HALL,
Chair: ZONG Chengqing
Return
|
09:30-09:55 |
Guangyou Zhou, Tingting He and Jun Zhao
show abstract
hide abstract
ABSTRACT: Cross-lingual sentiment classification aims to automatically predict sentiment polarity (e.g., positive or negative) of data in a label- scare target language by exploiting labeled data from a label-rich language.The fundamental challenge of cross-lingual learning stems from a lack of overlap between the feature spaces of the source language data and that of the target language data. To address this challenge,previous work in the literature mainly relies on machine translation engines or bilingual lexicons to directly adapt labeled data from the source language to the target language. However, machine translation may change the sentiment polarity of the original data. In this paper, we propose a new model which uses stacked autoencoders to learn language-independent distributed representations for the source and target languages in an unsupervised fashion. Sentiment classifiers trained on the source language can be adapted to predict sentiment polarity of the target language with the language-independent distributed representations. We conduct extensive experiments on English-Chinese sentiment classification tasks of multiple data sets. Our experimental results demonstrate the efficacy of the proposed cross-lingual approach.
|
09:55-10:20 |
Kun XU, Sheng Zhang, Yansong Feng and Dongyan Zhao
show abstract
hide abstract
ABSTRACT: Understanding natural language questions and converting them into structured queries have been considered as a crucial way to help users access large scale structured knowledge bases. However, the task usually involves two main challenges: recognizing users’query intention and mapping the involved semantic items against a given knowledge base (KB). In this paper, we propose an efficient pipeline framework to model a user’s query intention as a phrase level dependency DAG which is then instantiated regarding a specific KB to construct the final structured query. Our model benefits from the efficiency of linear structured prediction models and the separation of KB-independent and KB-related modelings. We evaluate our model on two datasets, and the experimental results showed that our method outperforms the state-of-the-art methods on the Free917 dataset, and, with limited training data from Free917, our model can smoothly adapt to new challenging dataset, WebQuestion, without extra training efforts while maintaining promising performances.
|
Fundamentals:
Dec. 8, 2014(13:30–15:10),
Kylin Villa Hotel- FENGHUANG HALL,
Chair: KONG Fang
Return
|
13:30-13:50 |
Haitong Yang and Chengqing Zong
show abstract
hide abstract
ABSTRACT: The predicate and its semantic roles compose a unified entity that conveys the semantics of a given sentence. A standard pipeline of current approaches to semantic role labeling (SRL) is that for a given predicate in a sentence, we can extract features for each candidate argument and then perform the role classification through a classifier. However, this process totally ignores the integrality of the predicate and its semantic roles. To address this problem, we present a global generative model in which a novel concept called Predicate- Arguments-Coalition (PAC) is proposed to encode the relations among individual arguments. Owing to PAC, our model can effectively mine the inherent properties of predicates and obtain a globally consistent solution for SRL. We conduct experiments on the standard benchmarks: Chinese PropBank. Experimental results on a single syntactic tree show that our model outperforms the state-of-the-art methods.
|
13:50-14:10 |
Fang Kong and Guodong Zhou
show abstract
hide abstract
ABSTRACT: Chinese comma disambiguation plays key role in many natural language processing (NLP) tasks. This paper proposes a joint approach combining K-best parse trees to Chinese comma disambiguation to reduce the dependent on syntactic parsing. Experimental results on a Chinese comma corpus show that the proposed approach significantly outperform the baseline system. To our best knowledge, this is the first work improving the performance of Chinese comma disambiguation on K-best parse trees. Moreover, we release a Chinese comma corpus which adds a layer of annotation to the manually-parsed sentences in the CTB (Chinese Treebank) 6.0 corpus.
|
14:10-14:30 |
Tingsong Jiang, Lei Sha and Zhifang Sui
show abstract
hide abstract
ABSTRACT: Event schema which comprises a set of related events and participants is of great importance with the development of information extraction (IE) and inducing event schema is prerequisite for IE and natural language generation. Event schema and slots are usually designed manually for traditional IE tasks. Methods for inducing event schemas automatically have been proposed recently. One of the fundamental assumptions in event schema induction is that related events tend to appear together to describe a scenario in natural-language discourse, meanwhile previous work only focused on co-occurrence in one document. We find that semantically typed relational tuples co-occurrence over multiple documents is helpful to construct event schema. We exploit the relational tuples co-occurrence over multiple documents by locating the key tuple and counting relational tuples, and build a co-occurrence graph which takes account of co-occurrence information over multiple documents. Experiments show that co-occurrence information over multiple documents can help to combine similar elements of event schema as well as to alleviate incoherence problems.
|
14:30-14:50 |
Bowei Zou, Guodong Zhou and Qiaoming Zhu
show abstract
hide abstract
ABSTRACT: Negation and speculation are common in natural language text. Many applications, such as biomedical text mining and clinical information extraction, seek to distinguish positive/factual objects from nega-tive/speculative ones (i.e., to determine what is negated or speculated) in biomedical texts. This paper proposes a novel task, called negation and speculation target identification, to identify the target of a negative or speculative expression. For this purpose, a new layer of the target information is incorporated over the BioScope corpus and a machine learning algorithm is proposed to automatically identify this new information. Evaluation justifies the effectiveness of our proposed approach on negation and speculation target identification in biomedical texts.
|
14:50-15:10 |
Yancui Li, jing sun and Guodong Zhou
show abstract
hide abstract
ABSTRACT: Chinese discourse structure analysis is helpful for many NLP applications, and Chinese discourse connective annotation and recognition is the basic task of discourse structure analysis. This paper introduces the annotation methods of Chinese discourse connective, and annotates a certain discourse connective corpus. Then extract syntax, lexical and position features of automatic syntax tree and standard syntax tree to recognize and classify connective. Experimental results show that connective recognition F1-measure is 69.2%, and connective classification accuracy is 89.1%.
|
Machine Translation:
Dec. 8, 2014(13:30–15:10),
Kylin Villa Hotel-MUMIAN HALL,
Chair: ZHAO Tiejun
Return
|
13:30-13:50 |
Jinhua Du, Miaomiao Wang and Meng Zhang
show abstract
hide abstract
ABSTRACT: This paper presents a simple but effective sentence-length informed method to select informative sentences for active learning (AL) based SMT. A length factor is introduced to penalize short sentences to balance the “exploration”and “exploitation”problem. The penalty is dynamically updated at each iteration of sentence selection by the ratio of the current candidate sentence length and the overall average sentence length of the monolingual corpus. Experimental results on NIST Chinese–English pair and WMT French-English pair show that the proposed sentence-length penalty based method performs best compared with the typical selection method and random selection strategy.
|
13:50-14:10 |
Chaochao Wang; Deyi Xiong; Min Zhang
show abstract
hide abstract
ABSTRACT: The computation of semantic similarity of translation equivalences is one of core problems in semanticsbased statistical machine translation. This paper proposes a translation similarity model based on bilingual compositional semantics in an attempt to integrate the bilingual semantic similarity feature into decoding process to improve translation quality. In the proposed model, obtain monolingual compositional vectors for phrases at the source and target side respectively using distributional approach. These monolingual vectors are then projected onto the same semantic space and therefore transformed into bilingual compositional vectors. Base on this semantic space, calculate translation similarity between source phrases and their corresponding target phrases. The computed similarities are integrated into the decoder as a new feature. Experiments on Chinese-to-English NIST 06 and NIST 08 test sets show that the proposed model significantly outperforms the baseline by 0.56 and 0.42 BLEU points respectively.
|
14:10-14:30 |
Chen Su, Yujie Zhang, Zhen Guo and Jin'an Xu
show abstract
hide abstract
ABSTRACT: The performance of statistical machine translation (SMT) suffers from the insufficiency of parallel corpus. To solve the problem, this paper propose a paraphrase based SMT framework with three solutions: (1) acquiring paraphrase knowledge based on a third language; (2) expressing multiple paraphrases of input sentence in a lattice and modifying decoder to be able to process it; (3) integrating paraphrase knowledge as features into log-linear model. In this way, not only more expressions in source language can be covered, but also more expressions in target language can be generated as candidate translations. To verify our method, we conduct experiments on three training data sets with different sizes, and evaluate the improvement of the performance of SMT system contributed by paraphrasing. The experimental results show that the translation performance is improved significantly (BLEU+1.4) when the parallel corpus is small (10K), and a good performance (BLEU+0.32) is also achieved when parallel corpus is large enough (1M).
|
14:30-14:50 |
Sitong Yang, Heng Yu and Qun Liu
show abstract
hide abstract
ABSTRACT: Post-editing has been successfully applied to correct the output of MT systems to generate better translation, but as a downstream task its positive feedback to MT has not been well studied. In this paper, we present a novel rule refinement method which uses Simulated Post- Editing (SiPE) to capture the errors made by the MT systems and generates refined translation rules. Our method is system-independent and doesn’t entail any additional resources. Experimental results on large- scale data show a significant improvement over both phrase-based and syntax-based baselines.
|
14:50-15:10 |
Jiangming Liu, Jinan Xu and Yujie Zhang
show abstract
hide abstract
ABSTRACT: Hierarchical phrase-based model has two main problems. Firstly, without any semantic guidance, large numbers of redundant rules are extracted. Secondly, it cannot efficiently capture long reordering. This paper proposes a novel approach to exploiting case frame in hierarchical phrase-based model in both rule extraction and decoding. Case frame is developed by case grammar theory, and it captures sentence structure and assigns components with different case information. Our case frame constraints system holds the properties of long distance reordering and phrase in case chunk-based dependency tree. At the same time, the number of HPB rules decrease with the case frame constraints. The results of experiments carried out on Japanese-Chinese test sets shows that our approach yields improvements over the HPB model (+1.48 BLEU on average).
|
Shared Task:
Dec. 8, 2014(13:30–15:10),
Kylin Villa Hotel-ZIJING HALL,
Chair: WAN Xiaojun
Return
|
13:30-15:10 |
Chengxin Li, Qin Jin and Huimin Wu
show abstract
hide abstract
ABSTRACT: Sentiment Analysis has been a hot research topic in recent years. Emotion classification is more detailed sentiment analysis which cares about more than the polarity of sentiment. In this paper, we present our system of emotion analysis for the Sina Weibo texts on both the document and sentence level, which detects whether a text is sentimental and further decides which emotion classes it conveys. The emotions of focus are seven basic emotion classes: anger, disgust, fear, happiness, like, sadness and surprise. Our baseline system uses supervised machine learning classifier (support vector machine, SVM) based on bag-of-words (BoW) features. In a contrast system, we propose a novel approach to construct an emotion lexicon and to generate a new feature representation of text which is named emotion vector eVector. Our experimental results show that both systems can classify emotion significantly better than random guess. Fusion of both systems obtains additional gain which indicates that they capture certain complementary information.
|
13:50-14:10 |
Mingqiang Wang, Mengting Liu, Shi Feng, Daling Wang and Yifei Zhang
show abstract
hide abstract
ABSTRACT: The microblogging services become increasingly popular for people to exchange their feelings and opinions. Extracting and analyzing the sentiments in microblogs have drawn extensive attentions from both academia researchers and commercial companies. The previous literature usually focused on classifying the microblogs into positive or negative categories. However, people’s sentiments are much more complex, and multiple fine-grained emotions may coexist in just one short microblog text. In this paper, we regard the emotion analysis as a multi-label learning problem and propose a novel calibrated label ranking based framework for detecting the multiple fine-grained emotions in the Chinese microblogs. We combine the learning-based method and lexicon-based method to build unified emotion classifiers, which alleviate the sparsity of the training microblog dataset. Experiment results using NLPCC 2014 evaluation dataset show that our proposed algorithm has achieved the best performance and significantly outperforms other participators’methods.
|
14:10-14:30 |
Baoli LI
show abstract
hide abstract
ABSTRACT: As the number of classes is quite large in a hierarchical text categorization problem, it usually costs much to obtain a training dataset of reasonable size and sample distribution. In this paper, several strategies are proposed and compared to generate new training samples from the class hierarchy in a hierarchical text classification problem. These solutions try to make full use of the class hierarchy (including class names, their descriptions if any, and relationships between them), and derive new pseudo training samples based on connotations and extensions of classes. Experiments on the dataset of the first large scale Chinese News Categorization at NLPCC2014 show that the localized expanding strategy based on class extensions performs better. Our official system achieved MacroF1 0.8413 and 0.7139 at level 1 and level 2 respectively, which ranked our system the second place among the 10 participating systems.
|
14:30-14:50 |
Large Scale Chinese News Categorization
Peng Wang
show abstract
hide abstract
ABSTRACT: No
|
Machine Learning for NLP:
Dec. 8, 2014(15:30–17:10),
FENGHUANG HALL,
Chair: MA Jun
Return
|
15:30-15:50 |
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen and Yan He
show abstract
hide abstract
ABSTRACT: Semantic matching is widely used in many natural language processing tasks. In this paper, we focus on the semantic matching between short texts and design a model to generate deep features, which describe the semantic relevance between short “text object”. Furthermore, we design a method to combine shallow features of short texts (i.e., LSI, VSM and some other handcraft features) with deep features of short texts (i.e., word embedding matching of short text). Finally, a ranking model (i.e., RankSVM) is used to make the final judgment. In order to evaluate our method, we implement our method on the task of matching posts and responses. Results of experiments show that our method achieves the state-of-the-art performance by using shallow features and deep features.
|
15:50-16:10 |
HuiWei Zhou, Long Chen and Degen Huang
show abstract
hide abstract
ABSTRACT: Sentiment classification system relies on high-quality emotional resources. However, these resources are imbalanced in different languages. The way of how to leverage rich labeled data of one language (source language) for the sentiment classification of resource-poor language (target language), namely cross-lingual sentiment classification (CLSC), becomes a focus topic. This paper utilizes rich English re-sources for Chinese sentiment classification. To eliminate the language gap between English and Chinese, this paper proposes a combination CLSC approach based on denoising autoencoder. First, two classifiers based on denoising autoencoder are learned respectively in English and Chinese views by using English corpus and English-to-Chinese corpus. Second, we classify Chinese test data and Chinese-to-English test data with the two classifiers trained in the two views. Last, the final sentiment classification results are obtained by the combination of the two results in two views. Experiments are carried out on NLP&CC 2013 CLSC dataset including book, DVD and music categories. The results show that our approach achieves the accuracy of 80.02%, which outperforms the current state-of-the-art systems.
|
16:10-16:30 |
Zhuoren Jiang, Yan Chen, Liangcai Gao, Zhi Tang and Xiaozhong Liu
show abstract
hide abstract
ABSTRACT: An innovative Supervised Dynamic Topic Model (S-DTM) is developed for overcoming the limitation of tranditional topic models. S-DTM models the time-varying language dynamics and is combined with supervised learning technology, by adding label restriction in topic variational inference, it makes the topic-label mapping and improves the interpret ability of topics. A set of experiments is conducted on a twenty-five-year-spanning Chinese journal paper corpus that is mainly focusing on natural language processing. The experiment results show that comparing with static supervised topic model and unsupervised dynamic topic model, S-DTM has a better semantic interpretation performance, reflects the topic structure of a document more accurately, captures the dynamic evolution of the term-distribution of topics more precisely.
|
16:30-16:50 |
Peng Wang, Heng Zhang, Bo Xu, Chenglin Liu and Hongwei Hao
show abstract
hide abstract
ABSTRACT: In this paper, we propose a novel feature enrichment method for short text classi%fication based on the link analysis on topic-keyword graph. After topic modeling, we re-rank the keywords distribution extracted by biterm topic model (BTM) to make the topics more salient. Then a topic-keyword graph is constructed and link analysis is conducted. For complement, the K-L divergence is integrated with the structural similarity to discover the most related keywords. At last, the short text is expanded by appending these related keywords for classification. Experimental results on two open datasets validate the effectiveness of the proposed method.
|
CIT Applications 1:
Dec. 8, 2014(15:30–17:10),
Kylin Villa Hotel-MUMIAN HALL,
Chair: LI Ning
Return
|
15:30-15:50 |
Wenjie Song; Junsheng Zhou; Yanhui Gu; Yujie Sun; Weiguang Qu
show abstract
hide abstract
ABSTRACT: Cilin and Chinese Concept Dictionary are used as dictionary resources in many NLP applications. In this paper, we study some strategies on Chinese synonyms extraction according to key word of the infobox in baidubaike and HTML tag of the web page in Zdic. Meanwhile, DIPRE is applied to discover high credible patterns and synonymous instances in encyclopedia corpora. Extensive experimental evaluation demonstrates that our proposed strategies outperform the NLP&CC 2012 evaluation results. Thus, we build a sophisticated synonym dictionary with manually proofreading for Noun part of The Grammatical Knowledge-base of Contemporary Chinese and try to make contributions to perfecting the semantic systems of the Grammatical Knowledge-base of Contemporary Chinese.
|
15:50-16:10 |
Enting Gao, Jiayuan Chao, Zhenghua Li
show abstract
hide abstract
ABSTRACT: Supervised statistical machine learning methods usually make use of single manually annotated corpus, which suffer from limited scale and genre coverage. Therefore, researchers try to exploit other existing source resources with different annotation standards to boost performance on target resources. Stacking Learning is the most representative method in this research line, but has two weak points. The first is that it can only make use of limited language phenomenon and the second is inefficient due to the need of twice decoding. Therefore, this paper proposes a annotation conversion method using multiple resources for POS tagging, aiming to convert the source-side annotations into target-side and then combine the data to get larger training data.This paper proposes two innovate strategies. The first strategy uses reliability information of guide features. The second strategy uses ambiguous labelings to improve the quality of converted data. Results demonstrate that our first strategy is helpful for annotation conversion while the second does little to conversion.
|
16:10-16:30 |
Linnan Bai; Renfen Hu; Zhiying Liu
show abstract
hide abstract
ABSTRACT: Traditional teaching method can’t give effective guidance to teaching and learning of comparative sentences in the field of TCSL (Teaching Chinese as Second Language). As a result, automatic recognition of comparative sentences is very significant. This paper proposes a novel method to identify comparative sentences based on rules, and these rules contain syntactic and semantic features of comparative sentences. Comparative marks and comparative result words are significant elements to identify comparative sentences. Based on this, this paper concludes the categories and identification rules of comparative sentences. Five models are designed to respectively recognize every category. Experiments show that proposed method can gain satisfactory results in comparative parser and recognition, which lay good foundation for comparative relation extraction.
|
16:30-16:50 |
Chenggang Mi, Yating Yang, Lei Wang, Xiao Li and Kamali Dalielihan
show abstract
hide abstract
ABSTRACT: For low-resource languages like Uyghur, data sparseness is always a serious problem in related information processing, especially in some tasks based on parallel texts. To enrich bilingual resources, we detect Chinese and Russian loan words from Uyghur texts according to phonetic similarities between a loan word and its corresponding donor language word. In this paper, we propose a novel approach based on perceptron model to discover loan words from Uyghur texts, which consider the detection of loan words in Uyghur as a classification procedure. The experimental results show that our method is capable of detecting the Chinese and Russian loan words in Uyghur Texts effectively.
|
16:50-17:10 |
Bingquan Liu, Jian Feng, Ming Liu, Feng Liu, Xiaolong Wang and Peng Li
show abstract
hide abstract
ABSTRACT: The computation of relatedness between two fragments of text or two words is a challenging task in many fields. In this study, we propose a novel method for measuring semantic relatedness between word units and between text units using an iterative process, which we refer to as the word-text mutual guidance (WTMG) method. WTMG combines the surface and contextual information when computing word or text relatedness. The iterative process can start in two different ways: calculating relatedness between texts using the initial relatedness of the words, or computing the relatedness between words using the initial relatedness of the texts. This method obtains the final relatedness result after the iterative process reaches convergence. We compared WTMG with previous relatedness computation methods, which showed that obvious improvements were obtained in terms of the correlation with human judgments.
|
Evaluation Workshop(Chinese Session):
Dec. 8, 2014(15:30–17:10),
ZIJING HALL,
Chair: XU Ruifeng
Return
|
15:30-15:50 |
Emotion Analysis in Chinese Weibo Texts(Tsinghua University)
Fei Jiang
show abstract
hide abstract
ABSTRACT: No
|
15:50-16:10 |
Chinese Entity Linking(Southwest University)
Southwest University
show abstract
hide abstract
ABSTRACT: No
|
16:10-16:30 |
Chinese Entity Linking(Beijing University of Aeronautics and Astronautics )
Fuxiang Wu
show abstract
hide abstract
ABSTRACT: No
|
16:30-16:50 |
Chinese Entity Linking(SouthWest JiaoTong University)
Zhen Jia
show abstract
hide abstract
ABSTRACT: No
|
Poster/Demo Presentations and Banquet:
Dec. 8, 2014(17:30–21:10),
KYLIN HALL,
Chair: TBD
Return
|
17:30-19:00 |
Xing Wang, Deyi Xiong, Min Zhang, Yu Hong and Jianmin Yao
show abstract
hide abstract
ABSTRACT: Reordering models are one of essential components of statistical machine translation. In this paper, we propose a topic-based reordering model to predict orders for neighboring blocks by capturing topic-sensitive reordering patterns. We automatically learn reordering examples from bilingual training data, which are associated with document-level and word-level topic information induced by LDA topic model. These learned reordering examples are used as evidences to train a topic-based reordering model that is built on a maximum entropy (MaxEnt) classifier. We conduct large-scale experiments to validate the effectiveness of the proposed topic-based reordering model on the NIST Chinese-to-English translation task. Experimental results show that our topic-based reordering model achieves significant performance improvement over the conventional reordering model using only lexical information.
|
17:30-19:00 |
Wenxu Long, Jixun Gao, Zhengtao YU, Shengxiang Gao and Xudong Hong
show abstract
hide abstract
ABSTRACT: On account of the characteristics of online Chinese-Vietnamese topic detection, we propose a Chinese-Vietnamese bilingual topic model based on the Recurrent Chinese Restaurant Process and integrated with event elements. First, the event elements, including the characters, the place and the time, will be extracted from the new dynamic bilingual news texts. Then the word pairs are tagged and aligned from the bilingual news and comments. Both the event elements and the aligned words are integrated into RCRP algorithm to construct the proposed bilingual topic detection model. Finally, we use the model to determine if the new documents will be grouped into a new category or classified into the existing categories, as a result, to detect a topic. Through the contrast experiment, the proposed model achieves a good effect on topic detection.
|
17:30-19:00 |
Weitai Zhang, Weiran XU, Guang Chen and Jun Guo
show abstract
hide abstract
ABSTRACT: In this paper, we introduce a new NLP task similar to word expansion task or word similarity task, which can discover words sharing the same semantic components (feature sub-space) with seed words. We also propose a Feature Extraction method based on Word Embeddings for this problem. We train word embeddings using state-of-the-art methods like word2vec and models supplied by Stanford NLP Group. Prior Statistical Knowledge and Negative Sampling are proposed and utilized to help extract the Feature Sub-Space. We evaluate our model on WordNet synonym dictionary dataset and compare it to word2vec on synonymy mining and word similarity computing task, showing that our method outperforms other models or methods and can significantly help improve language understanding.
|
17:30-19:00 |
Zhongqing Wang, Liyuan Lin, Shoushan Li and Guodong Zhou
show abstract
hide abstract
ABSTRACT: Opinion summarization on conversations aims to generate a sentimental summary for a dialogue and is shown to be much more challenging than traditional topic-based summarization and general opinion summarization, due to its specific characteristics. In this study, we propose a graph-based framework to opinion summarization on conversations. In particular, a random walk model is proposed to globally rank the utterances in a conversation. The main advantage of our approach is its ability of integrating various kinds of important information, such as utterance length, opinion, and dialogue structure, into a graph to better represent the utterances in a conversation and the relationship among them. Besides, a global ranking algorithm is proposed to optimize the graph. Empirical evaluation on the Switchboard corpus demonstrates the effectiveness of our approach.
|
17:30-19:00 |
Chengxu Ye, Wushao Wen and Ping Yang
show abstract
hide abstract
ABSTRACT: The microblog platforms, such as Weibo, now accumulate a large scale of data including the Tibetan messages. Discovering the latent topics from such huge volume of Tibetan data plays a significant role in tracing the dynamics of the Tibetan community, which contributes to uncover the public opinion of this community to the government. Although topic models can find out the latent structure from traditional document corpus, their performance on Tibetan messages is unsatisfactory because the short messages cause the severe data spasity challenge. In this paper, we propose a novel model called TM-ToT, which is derived from ToT (Topic over Time) aiming at mining latent topics effectively from the Tibetan messages. Firstly, we assume each topic is a mixture distribution influenced by both word co-occurrences and messages timestamps. Therefore, TM-ToT can capture the changes of each topic over time. Subsequently, we aggregate all messages published by the same author to form a lengthy pseudo-document to tackle the data sparsity problem. Finally, we present a Gibbs sampling implementation for the inference of TM-ToT. We evaluate TM-ToT on a real dataset. In our experiments, TM-ToT outperforms Twitter-LDA by a large margin in terms of perplexity. Furthermore, the quality of the generated latent topics of TM-ToT is promising.
|
17:30-19:00 |
Zeyu Meng, Dong Yu and Endong Xun
show abstract
hide abstract
ABSTRACT: Microblog has provided a convenient and instant platform for information publication and acquisition.Microblog’s short, noisy, real-time features make Chinese Microblog entity linking task a new challenge. In this paper, we investigate the linking approach and introduce the implementation of a Chinese Microblog Entity Linking (CMEL) System. In particular, we first build synonym dictionary and process the special identifier. Then we generate candidate set combining Wikipedia and search engine retrieval results. Finally, we adopt improved VSM to get textual similarity for entity disambiguation. The accuracy of CMEL system is 84.35%, which ranks the second place in NLPCC 2014 Evaluation Entity Linking Task.
|
17:30-19:00 |
Lin Gui, Li Yuan, Ruifeng Xu, Bin Liu, Qin Lu and Yu Zhou
show abstract
hide abstract
ABSTRACT: To identify the cause of emotion is a new challenge for researchers in nature language processing. Currently, there is no existing works on emotion cause detection from Chinese micro-blogging (Weibo) text. In this study, an emotion cause annotated corpus is firstly designed and developed through annotating the emotion cause expressions in Chinese Weibo Text. Up to now, an emotion cause annotated corpus which consists of the annotations for 1,333 Chinese Weibo is constructed. Based on the observations on this corpus, the characteristics of emotion cause expression are identified. Accordingly, a rule- based emotion cause detection method is developed which uses 25 manually complied rules. Furthermore, two machine learning based cause detection methods are developed including a classification-based method using support vector machines and a sequence labeling based method using conditional random fields model. It is the largest available resources in this research area. The experimental results show that the rule-based method achieves 68.30% accuracy rate. Furthermore, the method based on conditional random fields model achieved 77.57% accuracy which is 37.45% higher than the reference baseline method. These results show the effectiveness of our proposed emotion cause detection method.
|
17:30-19:00 |
Jian Wang, Xianhui Liu, Junli Wang and Weidong Zhao
show abstract
hide abstract
ABSTRACT: Time stamped texts or text sequences are ubiquitous in real life, such as news reports. Tracking the topic evolution of these texts has been an issue of considerable interest. Recent work has developed methods of tracking topic shifting over long time scales. However, most of these researches focus on a large corpus. Also, they only focus on the text itself and no attempt have been made to explore the temporal distribution of the corpus, which could provide meaningful and comprehensive clues for topic tracking. In this paper, we formally address this problem and put forward a novel method based on the topic model. We investigate the temporal distribution of news reports of a specific event and try to integrate this information with a topic model to enhance the performance of topic model. By focusing on a specific news event, we try to reveal more details about the event, such as, how many stages are there in the event, what aspect does each stage focus on, etc.
|
17:30-19:00 |
Ning Li, Yin Liu, Qi Liang and Xue Feng
show abstract
hide abstract
ABSTRACT: Starting with the analysis of the limitation of traditional methods to integrate re-flowable document and fixed-layout documents, a reversible transformation method was proposed which records the transformation information into the target document so that the source document can be restored when necessary using the information recorded. Selected UOF as the source re-flowable document format and CEBX as the target fixed-layout document format, a successful reversible transformation from UOF into CEBX was implemented using the method proposed. The experiment showed that the method has fairly good result. This paper studied the two document structures UOF and CEBX first, then focused on the method, discussed in detail about the principle, key techniques, experiment result, advantages and future work.
|
17:30-19:00 |
Yu He, Da Pan and Guohong Fu
show abstract
hide abstract
ABSTRACT: Opinion mining has been a hot topic in natural language processing over the past years. While most previous studies on opinion mining have focused on sentiment classification and opinion extraction, little research has been done on mining the underlying reasons for opinions. This paper is primarily concerned with explanatory sentence extraction in Chinese product reviews. To this end, we first reformulate explanatory opinion sentence extraction as a classification problem. Then, we employ the auto-encoding technique to learn word embeddings from product reviews. Finally, we incorporate word embeddings into the supported vector machines to perform explanatory sentence classification. Experimental results over product review on automobile and cellphone show that word embeddings are more effective than some traditional feature selection methods like TF-IDF and information gains for explanatory sentence classification.
|
17:30-19:00 |
Simeng Wang, LIangcai Gao, Yuehan Wang, Pingli Li and Zhi Tang
show abstract
hide abstract
ABSTRACT: Chinese forms with the same or similar layouts are difficult to identify due to huge disturbance caused by differences among user filled-in data and limited discriminative identifiers of preprinted data, which are usually titles of forms. Unfortunately, the existing form identification algorithms can hardly meet the requirements in similar layout form identification. In this paper, we propose a simple but effective distance based method to identify forms with similar layouts by measuring the user filled-in data, preprinted data and dithering data. The proposed method utilizes three kinds of weight components to mitigate the impact of randomness of user filled-in data, consistency of similar layouts and position dithering respectively. Experimental results achieve more than 90% identification accuracy on a series of data sets, which are significantly better than the results of the state-of-the-art method.
|
17:30-19:00 |
Binbin He; Lei Zou; Dongyan Zhao
show abstract
hide abstract
ABSTRACT: In natural language processing area, many technologies need the assist of knowledge base to complete the process. Many issues about data quality have been studied in relational data. This paper focuses on the discovery of abnormal data in RDF graphs. Although association rule has been used to find abnormal data in RDF graph, existing solutions ignore the latent semantics of connected structures in RDF graphs. In order to detect latent dependencies in RDF graph, firstly, we innovatively define Graph-based Conditional Functional Dependency(GCFD) that can represent the attribute value and semantic structure dependencies of RDF data in a uniform manner. Then, we propose an efficient framework and some novel pruning rules to discover GCFDs, and give the workflow of auto-repairing errorneous data. Extensive experiments on several real-life RDF repositories confirm the superiority of our solution.
|
17:30-19:00 |
Weiliang Chen and Xiao Sun
show abstract
hide abstract
ABSTRACT: With the rapid development of computer science, artificial intelligence has become a hot research direction in the field of computer science research at present. And emotion recognition in the man-machine interaction is a key technology to artificial intelligence, and mandarin speech emotion recognition is one importance of the emotion recognition . It had a problem that the dimension of the speech emotion characteristic value is big and it is difficult to train.To solve this problem,a new speech emotion recognition model, MFCCG-PCA model, is put forward by the combination of the MFCC model and the PCA model. Through multiple sets of experiments, the MFCCG-PCA model has larger performance improvement than general MFCC model in the aspect of speech emotion recognition.
|
17:30-19:00 |
Boli Wang; Xiaodong Shi; Wenyao Ren; Siyao Yan
show abstract
hide abstract
ABSTRACT: In this article, corpus linguistics methods were applied to prove that there is simplification phenomenon of Chinese characters in Taiwan. Firstly, a Taiwan Chinese corpus was built up with a large number of texts from media, government website and blog. Secondly, with statistics from corpus, this article proved that civilians in Taiwan prefer to use those popular Chinese characters with fewer strokes, which implies a simplification phenomenon. Lastly, this article analyzed several influential factors of the simplification of Chinese characters in Taiwan, including simplified Chinese from mainland, Chinese character encoding and Chinese IME.
|
17:30-19:00 |
Xiaochen Lv; Endong Xun; Weihua An; Yannan Sun
show abstract
hide abstract
ABSTRACT: For intelligent teaching of Chinese character handwriting, this paper presents a stroke retrieval method for handwritten Chinese character images, which includes three steps. Firstly, the method extracts the skeletons from the handwritten image. Secondly, from the perspective of knowledge engineering, it eliminates the skeleton distortions by using the stable grapheme topology. Thirdly, it divides the skeletons into some strokes and outputs the matching relationship between them and the strokes in the template character, by building and solving the similarity model with A* algorithm. The result of our method can be used to automatic quality assessment for handwritten Chinese character images.
|
17:30-19:00 |
Siyao Yan; Xuling Zheng; Xiaodong Shi; Fakui Zheng
show abstract
hide abstract
ABSTRACT: In this paper, as the study of Chinese classical poetry, it is the first attempt to natural language processing, computing-related research poetics and computer animation combine to solve the classical poetry animation automatically generated. And it proposed a SVM-based method to collaboratively determine the Ancient’s style, theme and time. In addition, a method according to the results of time-based Ancient scene classification is proposed and also basing on keywords of this paper analyzes the relationship between the use of co-occurrence of a subsequent generation of animation animated element supplement. According to the experiments, this paper show that the proposed research methods, initially solved the classical poetry animation automatically generated problems, but also provides a theoretical basis and experimental foundation for subsequent research.
|
17:30-19:00 |
Qi Rao; Peiyan Wang; Guiping Zhang
show abstract
hide abstract
ABSTRACT: The SAO-based qualitative analysis of patent is an important method to analyse the contents of patent literatures. The key point of this method is the extraction of SAO structure. To resolve the problem of SAO-based relation extraction from Chinese patent literatures, this paper implemented a series of experiments using Support Vector Machines. It focuses on the analysis of the validity of basic lexical information, syntactic information such as the shortest path enclosed trees, and distance features that used in related works. The experiments show that simple lexical features can contribute to a good performance, while syntactic features cannot bring a remarkable improvement. Moreover, the feasibility of a new representation of words, word distributed representation, was to be proved on SAO-based relation extraction in the paper.
|
17:30-19:00 |
Renfen Hu and Yuchen Zhu
show abstract
hide abstract
ABSTRACT: This paper proposes a text classification model for Tang poetry. Firstly define seven categories for poetry themes: love and marriage, frontier war, friendship and farewell, journey and homesick, landscape and countryside, history and nostalgia, others. 500 Tang poems are selected as research samples, and they are represented in vectors with Space Vector Model. To reduce the vector dimensions, feature selection is made by chi-square test. Two classifiers are built based on Naive Bayes and Support Vector Machine algorithms. The models perform well in classification experiment. Besides, they verify the positive effect of poetry titles, authors and types to poetry themes by text classification models, which could offer scientific reference to the related research of Tang poetry.
|
17:30-19:00 |
Kan Liu and Yunying Yuan
show abstract
hide abstract
ABSTRACT: According to the characteristics of short texts, we propose a feature extraction and clustering algorithm named deep denoise sparse auto-encoder. The algorithm takes the advantage of deep learning, transforming those high-dimensional, sparse vectors into new, low-dimensional, essential ones. Firstly, we introduce L1 paradigm to avoid overfitting, and then add noises to improve the robustness. Experimental result shows that applying extracted text features can significantly improve the effectiveness of clustering. It is a valid solution to the high-dimensional, sparse problems which short text vector always confront.
|
17:30-19:00 |
Emotion Analysis in Chinese Weibo Texts(Tsinghua Univ.)
Fei Jiang
show abstract
hide abstract
ABSTRACT: No
|
17:30-19:00 |
Sentiment Classification with Deep Learning(Zhengzhou Univ.)
Li Wan
show abstract
hide abstract
ABSTRACT: No
|
17:30-19:00 |
Sentiment Classification with Deep Learning -(Dalian University of Technology )
Long Chen
show abstract
hide abstract
ABSTRACT: No
|
17:30-19:00 |
Chinese Entity Linking -(Southwestern University)
Southwestern University
show abstract
hide abstract
ABSTRACT: No
|
17:30-19:00 |
Chinese Entity Linking -(Communication University of China)
Communication University of China
show abstract
hide abstract
ABSTRACT: No
|
17:30-19:00 |
Large Scale Chinese News Categorization(SouthWest JiaoTong University)
Zhen Jia
show abstract
hide abstract
ABSTRACT: No
|
Sentiment Analysis:
Dec. 9, 2014(13:30–15:10),
FENGHUANG HALL,
Chair: XIA Yunqing
Return
|
13:30-13:50 |
Yanyan Zhao, Bing Qin and Ting Liu
show abstract
hide abstract
ABSTRACT: Target extraction is an important task in opinion mining, in which a complete target consists of an aspect and its corresponding object. However, previous work always simply considers the aspect as the target and ignores an important element “object.”Thus the incomplete target is of limited use for practical applications. This paper proposes a novel and important sentiment analysis task: aspect-object alignment, which aims to obtain the correct corresponding object for each aspect, to solve the “object ignoring” problem. We design a two-step framework for this task. We first provide an aspect-object alignment classifier that incorporates three sets of features. However, the objects assigned to aspects in a sentence often contradict each other. To solve this problem, we impose two kinds of constraints: intra-sentence constraints and intersentence constraints, which are encoded as linear formulations and use Integer Linear Programming (ILP) as an inference procedure to obtain a final global decision in the second step. The experiments on the corpora of camera domain show the effectiveness of the framework.
|
13:50-14:10 |
Junjie Li, Yu Zhou, Chunyang Liu and Lin Pang
show abstract
hide abstract
ABSTRACT: We present the study of sentiment classification of Chinese contrast sentences in this paper, which are one of the commonly used language constructs in text. In a typical review, there are at least around 6% of such sentences. Due to the complex contrast phenomenon, it is hard to use the traditional bag-of-words to model such sentences. In this paper, we propose a Two-Layer Logistic Regression (TLLR) model to leverage such relationship in sentiment classification. According to different connectives, our model can treat different clauses differently in sentiment classification. Experimental results show that TLLR model can effectively improve the performance of sentiment classification of Chinese contrast sentences.
|
14:10-14:30 |
Zhu Zhu, Min Dai, Shoushan Li and Guodong Zhou
show abstract
hide abstract
ABSTRACT: Opinion target is an important element of the sentiment information. Considering the common phenomenon in the ellipsis of Chinese expression, the ellipsis of opinion target in Chinese sentiment texts is also frequent. The paper employs a method to recognize the ellipsis of opinion target in Chinese text. The proposed approach treats the task of opinion target ellipsis as a binary classification problem, which applies the machine learning algorithm. Then we present three kinds of features, namely sentence’s position-independent features, sentence’s position-dependent features and contextual features, and test their effects on the recognition performance separately. The experimental results in three domains demonstrate that the machine learning-based method is effective for the task of the recognition of opinion target ellipsis.
|
14:30-14:50 |
Yuan Wang, Zhaohui Li, Jie Liu, Zhicheng He, Yalou Huang and Dong Li
show abstract
hide abstract
ABSTRACT: Recent years, an amount of product reviews on the internet have become an important source of information for potential customers. These reviews do help to research products or services before making purchase decisions. Thus, sentiment analysis of product reviews has become a hot issue in the field of natural language processing and text mining. Considering good performances of unsupervised neural network language models in a wide range of natural language processing tasks, a semi-supervised deep learning model has been proposed for sentiment analysis. The model introduces supervised sentiment labels into traditional neural network language models. It enhances expression ability of sentiment information as well as semantic information in word vectors. Experiments on NLPCC2014 product review datasets demonstrate that our method outperforms the traditional methods and methods of other teams.
|
14:50-15:10 |
Siqiang Wen, Zhixing Li and Juanzi Li
show abstract
hide abstract
ABSTRACT: Social context understanding is a fundamental problem on social analysis. Social contexts are usually short, informal and incomplete and these characteristics make methods for formal texts give poor performance on social contexts. However, we discover part of relations between importance words in formal texts are helpful to understand social contexts. We propose a method that extracts semantic chunks using these relations to express social contexts. A semantic chunk is a phrase which is meaningful and significant expression describing the fist of given texts. We exploit semantic chunks by utilizing knowledge learned from semantically parsed corpora and knowledge base. Experimental results on Chinese and English data sets demonstrate that our approach improves the performance significantly.
|
Information Extraction:
Dec. 9, 2014(13:30–15:10),
Kylin Villa Hotel-MUMIAN HALL,
Chair: YU Zhengtao
Return
|
13:30-13:50 |
Hao Wang, Zhenyu Qi, Hongwei Hao and Bo Xu
show abstract
hide abstract
ABSTRACT: Entity relation extraction is an important task for information extraction, which refers to extracting the relation between two entities from input text. Previous researches usually converted this problem to a sequence labeling problem and used statistical models such as conditional random field model to solve it. This kind of method needs a large, high-quality training dataset. So it has two main drawbacks: 1) for some target relations, it is not difficult to get training instances, but the quality is poor; 2) for some other relations, it is hardly to get enough training data automatically. In this paper, we propose a hybrid method to overcome the shortcomings. To solve the first drawback, we design an improved candidate sentences selecting method which can find out high-quality training instances, and then use them to train our extracting model. To solve the second drawback, we produce heuristic rules to extract entity relations. In the experiment, the candidate sentences selecting method improves the average F1 value by 78.53% and some detailed suggestions are given. And we submitted 364944 triples with the precision rate of 46.3% for the competition of Sougou Chinese entity relation extraction and rank the 4th place in the platform.
|
13:50-14:10 |
Xianqi Zou, Yaming Sun, Chengjie SUN, Bingquan Liu and Lei Lin
show abstract
hide abstract
ABSTRACT: Entity linking has received much more attention. The purpose of entity linking is to link the mentions in the text to the corresponding entities in the knowledge base. Most work of entity linking is aiming at long texts, such as BBS or blog. Microblog as a new kind of social platform, entity linking in which will face many problems. In this paper, we divide the entity linking task into two parts. The first part is entity candidates’generation and feature extraction. We use Wikipedia articles information to generate enough entity candidates, and as far as possible eliminate ambiguity candidates to get higher coverage and less quantity. In terms of feature, we adopt belief propagation, which is based on the topic distribution, to get global feature. The experiment results show that our method achieves better performance than that based on common links. When combining global features with local features, the performance will be obviously improved. The second part is entity candidates ranking. Traditional learning to rank methods have been widely used in entity linking task. However, entity linking does not consider the ranking order of non-target entities. Thus, we utilize a boosting algorithm of non-ranking method to predict the target entity, which leads to 77.48% accuracy.
|
14:10-14:30 |
Bingfeng Luo, Huanquan Lu, Yigang Diao, Yansong Feng and Dongyan Zhao
show abstract
hide abstract
ABSTRACT: Automaticallyconstructedknowledgebasesoftensufferfrom quality issues such as the lack of attributes for existing entities. Manually finding and filling missing attributes is time consuming and expensive since the volume of knowledge base is growing in an unforeseen speed. We, therefore, propose an automatic approach to suggest missing attributes for entities via hierarchical clustering based on the intuition that similar entities may share a similar group of attributes. We evaluate our method on a randomly sampled set of 20,000 entities from DBPedia. The experimental results show that our method can achieve a high precision and outperform existing methods.
|
14:30-14:50 |
Jiagen Shu; Haotian Hui; Longhua Qian; Qiaoming Zhu
show abstract
hide abstract
ABSTRACT: Entity Linking has great significance on the research and application fields of natural language processing, such as information fusion, knowledge acquisition and knowledge graph etc. In view of the lack of Chinese entity linking benchmark corpus, this paper applies the methodology of automatic construction and manual annotation to build a Chinese entity linking corpus as well as its related Chinese knowledge base derived from the ACE2005 Chinese corpus and the Chinese Wikipedia resource. Contray to traditional English entity linking corpus, this corpus is based on entities rather than individual entity mentions. We believe the construction of Chinese entity linking corpus provides a benchmark platform to the Chinese entity linking research community.
|
14:50-15:10 |
Xuewei Li, Xueqiang Lv and Kehui Liu
show abstract
hide abstract
ABSTRACT: Recognition of Chinese location entity is an important part of event extraction. In this paper we propose a novel method to identify Chinese location entity based on the divide-and-conquer strategy. Firstly, we use CRF role labeling to identify the basic place name. Secondly, by using semi-automatic way, we build indicator lexicon. Finally, we propose attachment connection algorithm to connect the basic place name with indicator, then we achieve the identification of location entity. In brief, our method decomposes location entity into basic place name and indicator, which is different from traditional methods. Results of the experiments show that the proposed method has an outstanding effect and the F-value gets to 84.79%.
|
CIT Applications 2:
Dec. 9, 2014(13:30–15:10),
Kylin Villa Hotel-ZIJING HALL,
Chair: BAI Shuanhu
Return
|
13:30-13:50 |
Jia Tian, Chang Su and Yijiang Chen
show abstract
hide abstract
ABSTRACT: As the findings about the embodiment of language comprehension and some difficulties in the existing models of metaphor processing, this paper presents an adjective-based embodied cognitive net, which constructs the comprehension of knowledge in a novel view. Different from the traditional way that takes concepts as the core of knowledge comprehension, this paper views the emotions as the core and the motive power that human beings knowing the world. It is claimed that the adjective is the carrier of emotion in this paper, rather the concept. From the very nature, while getting a new thing, the first thing that comes to human’s mind are the original descriptions(usually are adjectives) and then are the concepts. Thus, this paper constructs a net based on adjectives from concrete to abstract according to the embodiment. In this knowledge net, nouns are contained as the attachment to construct a mapping between adjectives and concepts.Specially, this paper gives the embodied emotion to the adjective to deal with the emotion inference and metaphor emotion analysis in the future work.
|
13:50-14:10 |
Jingwei Qu, Xiaoqing Lu, Lu Liu and Zhi Tang
show abstract
hide abstract
ABSTRACT: Density analysis plays an important role in font design and recognition. This paper presents a method of density analysis for Chinese characters. A number of density metrics are adopted to describe the density degree of a character from both local and global perspectives, including center-to-center distance of connected components, gap between connected components, ratio of perimeter and area, connected components area ratio, and area ratio of holes. The experiment results demonstrate that the proposed method is effective in measuring the density of Chinese characters.
|
14:10-14:30 |
Yunqing Xia, Huan Zhao, Kaiyu Liu and Hualing Zhu
show abstract
hide abstract
ABSTRACT: Healthcare data mining and business intelligence are attracting huge industry interest in recent years. Engineers encounter a bottleneck when applying data mining tools to textual healthcare records. Many medical terms in the healthcare records are different from the standard form, which are referred to as informal medical terms in this work. Study indicates that in Chinese healthcare records, a majority of the informal terms are abbreviations or typos. In this work, a multi-field indexing approach is proposed, which accomplishes the term normalization task with information retrieval algorithm with four level indices: word, character, pinyin and its initial. Experimental results show that the proposed approach is advantageous over the state-of-the-art approaches.
|
14:30-14:50 |
Guangyi Li and Houfeng WANG
show abstract
hide abstract
ABSTRACT: grasping the trend of academic development. For the task of keyword extraction for Chinese scientific articles, we adopt the framework of selecting keyword candidates by Document Frequency Accessor Variety(DF-AV) and running TextRank algorithm on a phrase network. To improve domain adaption of keyword extraction, we introduce known keywords of a certain domain as domain knowledge into this framework. Experimental results show that domain knowledge can improve performance of keyword extraction generally.
|
14:50-15:10 |
Xiaodong Zhang and Houfeng WANG
show abstract
hide abstract
ABSTRACT: Question clustering plays an important role in QA systems. Due to data sparseness and lexical gap in questions, there is no sufficient information to guarantee good clustering results. Besides, previous works pay little attention to the complexity of algorithms, resulting in infeasibility on large-scale datasets. In this paper, we propose a novel similarity measure, which employs word relatedness as additional information to help calculating similarity between questions. Based on the similarity measure and k-means algorithm, semantic k-means algorithm and its extended version are proposed. Experimental results show that the proposed methods have comparable performance with state-of-the- art methods and cost less time.
|
NLP for Social Media:
Dec. 9, 2014(15:30–17:10),
Kylin Villa Hotel- FENGHUANG HALL,
Chair: LI Juanzi
Return
|
15:30-15:50 |
Ying Chen and Bei Pei
show abstract
hide abstract
ABSTRACT: In this paper, we propose a weakly-supervised occupation detection approach which can automatically detect occupation information for microblogging users. The weakly-supervised approach makes use of two types of user information (tweets and personal descriptions) through a rule-based user occupation detection and a MCS-based (MCS: a multiple classifier system) user occupation detection. First, the rule-based occupation detection uses the personal descriptions of some users to create pseudo-training data. Second, based on the pseudo-training data, the MCS-based occupation detection uses tweets to do further occupation detection. However, the pseudo-training data is severely skewed and noisy, which brings a big challenge to the MCS-based occupation detection. Therefore, we propose a class-based random sampling method and a cascaded ensemble learning method to overcome these data problems. The experiments show that the weakly-supervised occupation detection achieves a good performance. In addition, although our study is made on Chinese, the approach indeed is language-independent.
|
15:50-16:10 |
Shi Feng, Le Zhang, Daling Wang and Yifei Zhang
show abstract
hide abstract
ABSTRACT: Nowadays, people usually like to extend their real-life social relations into the online virtual social networks. With the blooming of Web 2.0 technology, huge number of users aggregate in the microblogging services, such as Twitter and Weibo, to express their opinions, record their personal lives and communicate with each other. How to recommend potential good friends for the target user has been a critical problem for both commercial companies and research communities. The key issue for online friend recommendation is to design an appropriate algorithm for user similarity measurement. In this paper, we propose a novel microblog user similarity model for online friend recommendation by linearly combining multiple similarity measurements of microblogs. Our proposed model can give a more comprehensive understanding of the user relationship in the microblogging space. Extensive experiments on a real-world dataset validate that our proposed model outperforms other baseline algorithms by a large margin.
|
16:10-16:30 |
Xueqin Sui, Zhumin Chen, Kai Wu, Pengjie Ren, Jun Ma and Fengyu Zhou
show abstract
hide abstract
ABSTRACT: People always exist in the two dimensional space, i.e. time and space, in the real world. How to detect users’locations automatically is significant for many location-based applications such as dietary recommendation and tourism planning. With the rapid development of social media such as Sina Weibo and Twitter, more and more people publish messages at any time which contain their real-time location information. This makes it possible to detect users’locations automatically by social media. In this paper, we propose a method to detect a user’s city-level locations only based on his/her published posts in social media. Our approach considers two components: a Chinese location library and a model based on words distribution over locations. The former one is used to match whether there is a location name mentioned in the post. The latter one is utilized to mine the implied location information under the non-location words in the post. Furthermore, for a user’s detected location sequence, we consider the transfer speed between two adjacent locations to smooth the sequence in context. Experiments on real dataset from Sina Weibo demonstrate that our approach can outperform baseline methods significantly in terms of Precision, Recall and F1.
|
16:30-16:50 |
Kan Liu; Yunying Yuan; Ping Liu
show abstract
hide abstract
ABSTRACT: Bot-users, defined as those accounts which are under the control of third-party software platform and able to publish, forward and review automatically, will help the operators reach their objectives of diffusing false information as well as broadcasting spam advisements. Their high-level automation, strong disguise power and targeted ability to release and spread harmful information have seriously affected the normal online public opinion. In this paper, we focus on Weibo users, establish a four-dimensional characteristic index of behavior pattern, content, relation and platform, use information entropy, content repetition rate etc to construct a feature vector and design a identification model based on Random Forest algorithm. Finally, the Sina Weibo set are used to verify the efficiency and effectiveness of the model, with the accuracy of 96.7%. The result indicates that the model will undoubtedly be good at distinguishing the Bot-users from the ordinary ones.
|
IR & QA:
Dec. 9, 2014(15:30–17:10),
Kylin Villa Hotel-MUMIAN HALL,
Chair: FENG Yansong
Return
|
15:30-15:50 |
Jiaxin Mao, Yiqun Liu, Min Zhang and Shaoping MA
show abstract
hide abstract
ABSTRACT: Click-through information has been regarded as one of the most important signals for implicit relevance feedback in Web search engines. Because large variation exists in users’personal characteristics, such as search expertise, domain knowledge, and carefulness, different user clicks should not be treated as equally important. Different from most existing works that try to estimate the credibility of user clicks based on click-through or querying behavior, we propose to enrich the credibility estimation framework with mouse movement and eye-tracking information. In the proposed framework, the credibility of user clicks is evaluated with a number of metrics in which a user in the context of a certain search session is treated as a relevant document classifier. With an experimental search engine system that collects click-through, mouse movement, and eye movement data simultaneously, we find that credible user behaviors could be separated from non-credible ones with a number of interaction behavior features. Further experimental results indicate that relevance prediction performance could be improved with the proposed estimation framework.
|
15:50-16:10 |
Yunqing Xia, Qiuge Zhang, Huiyuan Wang and Huan Zhao
show abstract
hide abstract
ABSTRACT: The previous work has justified the assumption that document ranking can be improved by further considering the coarse-grained relations in various linguistic levels (e.g., lexical, syntactical and semantic). To the best of our knowledge, little work is reported to incorporate the fine-grained ontological relations (e.g., ) in document ranking. Two contributions are worth noting in this work. First, three major combination models (i.e., summation, multiplication, and amplification) are designed to re-calculate the query- document relevance score considering both the term-level Okapi BM25 relevance score and the relation-level relevance score. Second, a vector- based scoring algorithm is proposed to calculate the relation-level relevance score. A few experiments on medical document ranking with CLEF2013 eHealth Lab medical information retrieval dataset show that the proposed document ranking algorithms can be further improved by incorporating the fine-grained ontological relations.
|
16:10-16:30 |
Xuewei Li; Xueqiang Lv; Zhian Dong; Kehui Liu
show abstract
hide abstract
ABSTRACT: The study of query classification can perfect the search result, improve the accuracy in advertising, and understand the real intention of the user. With the domain Url organized manually from aggregator sites , we propose a variance based method to identify domain Url-key by utilizing the use frequency of Url-key in each category. Then, we filter the Url-key by using machine translation, pinyin and search results feedback technology. Finally, coupled with relevance feedback, we classify the query by selecting the Url-key as feature and establishing the Url-key vector with a SVM multi class classifier. Experimental results show that our method not only uses less resources, but also the F-value is 7% higher than contrast method.
|
16:30-16:50 |
Juan Hu; Yu Bai; Dongfeng Cai
show abstract
hide abstract
ABSTRACT: User modeling is the essential technology for personalized service. The query log, which implies large quantity of user’s interest, is one of the most primary sources of data for user modeling. We proposed a query weighted-based method for user modeling by simulating the interaction between user and search engine. First, the query log was divided into sessions according to the session division principle. And then, for each session, a group of user behavior information, such as query frequency, duration and the ranks of the clicked URLs, was employed to calculate the weight of queries. Finally, the voting method was used to generate user model. The experiment results show the effectiveness of the method over the AOL query log dataset.
|
16:50-17:10 |
Hong Sun, Furu Wei and Ming Zhou
show abstract
hide abstract
ABSTRACT: Answer Extraction of Web-based Question Answering aims to extract answers from snippets retrieved by search engines. Search results contain lots of noisy and incomplete texts, thus the task becomes more challenging comparing with traditional answer extraction upon offline corpus. In this paper we discuss the important role of employing multiple extraction engines for Web-based Question Answering. Aggregating multiple engines could ease the negative effect from the noisy search results on single method. We adopt a Pruned Rank Aggregation method which performs pruning while aggregating candidate lists provided by multiple engines. It fully leverages redundancies within and across each list for reducing noises in candidate list without hurting answer recall. In addition, we rank the aggregated list with a Learning to Rank framework with similarity, redundancy, quality and search features. Experiment results on TREC data show that our method is effective for reducing noises in candidate list, and greatly helps to improve answer ranking results. Our method outperforms state-of-the-art answer extraction method, and is sufficient in dealing with the noisy search snippets for Web-based QA.
|
Open Fund & Expo(Chinese Session):
Dec. 9, 2014(15:30–17:10),
Kylin Villa Hotel-ZIJING HALL,
Chair: TANG Zhi
Return
|
15:30-15:50 |
基于社交网络的问答专家推荐技术
周光有
show abstract
hide abstract
ABSTRACT: No
|
15:50-16:10 |
基于汉字动态描述库的WebFont技术研究
熊晶
show abstract
hide abstract
ABSTRACT: No
|
16:10-16:30 |
汉字笔画的自动提取
梁燕
show abstract
hide abstract
ABSTRACT: No
|
16:30-16:50 |
基于搜索引擎的倒排索引技术研究
才让叁智
show abstract
hide abstract
ABSTRACT: No
|
16:50-17:10 |
基于Linked Data的中文学术期刊语义知识组织与服务关键技术研究
王鑫
show abstract
hide abstract
ABSTRACT: No
|