Best Papers:
Oct. 13, 2015(09:30–10:20),
Treasure Palace Hotel-Palace Grand Ballroom(碧丽宫),
Chair: Juanzi LI
Return
|
09:30-09:55 |
Yujie Cao, Minlie Huang and Xiaoyan Zhu
show abstract
hide abstract
ABSTRACT: Clustering sentiment phrases in product reviews is convenient for us to get the most important information about one product directly through thou- sands of reviews. There are mainly two components in a sentiment phrase, the aspect word and the opinion word. We need to cluster these two parts simulta- neously. Although several methods have been proposed to cluster words or phrases, limited work has been done on clustering two-dimensional sentiment phrases. In this paper, we apply a two-sided hidden Markov random field (HMRF) model on this task. We use the approach of constrained co-clustering with some priori knowledge, in a semi-supervised setting. Experimental results on sentiment phrases extracted from about 0.7 million mobile phone reviews show that this method is promising for this task and our method outperforms baselines remarkably.
|
09:55-10:20 |
Baolin Peng, Kaisheng Yao, Jing Li and Kam-Fai Wong
show abstract
hide abstract
ABSTRACT: Recurrent Neural Networks (RNNs) have become increas- ingly popular for the task of language understanding. In this task, a semantic tagger is deployed to associate a semantic label to each word in an input sequence. The success of RNN may be attributed to its ability to memorise long-term dependence that relates the current-time semantic label prediction to the observations many time instances away. However, the memory capacity of simple RNNs is limited because of the gradient vanishing and exploding problem. We propose to use an external memory to improve memorisation capability of RNNs. Experiments on the ATIS dataset demonstrated that the proposed model was able to achieve the state-of-the-art results. Detailed analysis may provide insights for future research.
|
Fundamentals:
Oct. 12, 2015(13:30–15:10),
Treasure Palace Hotel-Four Seasons(二楼四季),
Chair: Guodong ZHOU
Return
|
13:30-13:50 |
Chenxi Zhu, Xipeng Qiu and Xuanjing Huang
show abstract
hide abstract
ABSTRACT: Long distance dependency relation is one of the main challenges for the state-of-the-art transition-based dependency parsing algorithms. In this paper, we propose a method to improve the performance of transition-based parsing with long distance collocations. With these long distance collocations, our method pro- vides an approximate global view of the entire sentence, which is a little bit sim- ilar to top-down parsing. To further improve the accuracy of decision, we extend the set of parsing actions with two more fine-grained actions based on the types of arcs. Experimental results show that our method improve the performance of parsing effectively, especially for long sentence.
|
13:50-14:10 |
Lvexing Zheng, Houfeng Wang and Xueqiang Lv
show abstract
hide abstract
ABSTRACT: Lexical semantic information plays an important role in su- pervised dependency parsing. In this paper, we add lexical semantic features to the feature set of a parser, obtaining improvements on the Penn Chinese Treebank. We extract semantic categories of words from HowNet, and use them as semantic information of words. Moreover, we investigate the method to compute semantic similarity between Chinese compound words, and obtain semantic information of words which did not record in HowNet. Our experiments show that unla- beled attachment scores can increase by 1.29%.
|
14:10-14:30 |
Rui Lin, Muyun Yang, Shujie Liu, Sheng Li and Tiejun Zhao
show abstract
hide abstract
ABSTRACT: This paper introduces a maximum entropy method to Discourse Co- herence Modeling (DCM). Different from the state-of-art supervised entity-grid model and unsupervised cohesion-driven model, the model we proposed only takes as input lexicon features, which increases the training speed and decoding speed significantly. We conduct an evaluation on two publicly available benchmark data sets via sentence ordering tasks, and the results confirm the ef- fectiveness of our maximum entropy based approach in DCM.
|
14:30-14:50 |
Chengjie SUN, Xiaoqiang Jin, Lei Lin, Yuming Zhao and Xiaolong Wang
show abstract
hide abstract
ABSTRACT: In this paper, convolutional neural networks are employed for English article error correction. Instead of employing features relying on human ingenuity and prior natural language processing knowledge, the words surrounding the context of the article are taken as features. Our approach could be trained both on an error annotated corpus and an error non-annotated corpus. Experiments are conducted on CoNLL- 2013 data set. Our approach achieves 38.10% in F1, and outperforms the best system (33.40%) that participates in the task. Experimental results demonstrate the effectiveness of our proposed approach.
|
14:50-15:10 |
Chang Su, Shuman Huang and Yijiang Chen
show abstract
hide abstract
ABSTRACT: The previous work of metaphor interpretation mostly fo- cused on single-word verbal metaphors and ignored the influence of con- textual information, leading to some limitations(e.g. ignore the polyse- my of metaphor). In this paper, we creatively propose the aspect-based semantic relatedness, and we present a novel metaphor interpretation method based on semantic relatedness for context-dependent nominal metaphors. First, we obtain the possible comprehension aspects accord- ing to the properties of source domain. Then, combined with contextual information, we calculate the degree of relatedness between the target and source domains from different aspects. Finally, we select the aspect which makes the relatedness between target and source domains maxi- mum as comprehension aspect, and the metaphor explanation is formed with corresponding property of source domain. The results show that our method has higher accuracy. In particular, when the information of target domain is insufficient in corpus, our method still exhibits the good performance.
|
Sentiment Analysis:
Oct. 12, 2015(13:30–15:10),
Treasure Palace Hotel-No.5 Meeting Room(二楼五号),
Chair: Ruifeng XU
Return
|
13:30-13:50 |
Jiaying Song, Xu Huang and Guohong Fu
show abstract
hide abstract
ABSTRACT: Feature selection and representation has been a key issue for polarity classification. This paper presents a lexical sentiment membership based feature representation for Chinese polarity classification under the framework of fuzzy set theory. To this end, we first use TF-IDF weighted words to construct the corresponding positive and negative polarity membership for each feature word. Then, we compute the log-ratio of each membership. Finally, we take the membership log-ratios as features and thus build a polarity classifier based on support vector machines. We also evaluate our approach over different datasets, including a corpus of reviews on automobile products, the NLPCC2014 data for sentiment classification evaluation and the IMDB film comments. The experimental results show that the proposed sentiment membership feature representation outperforms the Boolean features, frequent-based features and the word embeddings based features.
|
13:50-14:10 |
Guoyong Cai and Binbin Xia
show abstract
hide abstract
ABSTRACT: Recently, user generated multimedia contents (e.g. text, image, speech and video) on social media are increasingly used to share their experiences and emotions, for example, a tweet usually contains both texts and images. Com- pared to sentiment analysis of texts and images separately, the combination of text and image may reveal tweet sentiment more adequately. Motivated by this rationale, we propose a method based on convolutional neural networks (CNN) for multimedia (tweets consist of text and image) sentiment analysis. Two indi- vidual CNN architectures are used for learning textual features and visual fea- tures, which can be combined as input of another CNN architecture for exploiting the internal relation between text and image. Experimental results on two re- al-world datasets demonstrate that the proposed method achieves effective per- formance on multimedia sentiment analysis by capturing the combined infor- mation of texts and images.
|
14:10-14:30 |
Shaowu Zhang, Huali Liu, Liang Yang and Hongfei LIN
show abstract
hide abstract
ABSTRACT: Cross-domain sentiment analysis focuses on these problems where the source domain and the target domain are from different do- mains. However, traditional sentiment classification approaches usually perform poorly to address cross-domain problems. So, this paper pro- posed a cross-domain sentiment classification method based on extrac- tion of key sentiment sentence. Firstly, based on the observation that not every part of the document is equally informative for inferring the sen- timent orientation of the whole document, the concept of key sentiment sentence was defined. Secondly, taking advantage of three properties: sentiment purity, keyword property and position property, we construct heuristic rules, and combine with machine learning to extract key senti- ment sentence. Then, data is divided into key and detail views. Integrat- ing two views effectively can improve performance. Finally, experimental results show the superiority of our proposed method.
|
14:30-14:50 |
cuijuan liu, zhen liu, yanjie chai, hao fang and liangping liu
show abstract
hide abstract
ABSTRACT: With the rapid development of Internet, the sentiment analysis of microblogs is becoming one of the important subjects in the study of the big data. Existing research works focus on the emotional tendency, which are lack of detailed description of all kinds of emotions. They can't intuitively reflect the emotional change of social groups. An emotional analysis method based on the combination of dependency parsing and artificial tagging was proposed, and facial expression animation to present emotions analysis was realized. The microblog crowd’s emotion in different areas for different social events was visualized. The experimental results showed that the model could closely and effectively simulate the crowd emotion, the research result could provide a new way of the analysis of network public opinion based on large data.
|
14:50-15:10 |
Junhui Shen, Peiyan Zhu, Rui Fan, Wei Tan and Xueyan Zhan
show abstract
hide abstract
ABSTRACT: With Western culture and science been widely accepted in China, Traditional Chinese Medicine (TCM) has become a controversial issue. So, it is important to study the public’s sentiment and opinions on TCM. The rapid development of online social network, such as twitter, make it convenient and efficient to sample hundreds of millions of people for the aforementioned sentiment study. To the best of our knowledge, the present work is the first attempt that applies sentiment analysis to the fields of TCM on Sina Weibo (a twitter-like microblogging service in China). In our work, firstly, we collected tweets topics about TCM from Sina Weibo, and labelled the tweets as supporting TCM or op- posing TCM automatically based on user tags. Then, a Support Vector Machine classifier was built to predict the sentiment of TCM tweets with- out tags. Finally, we presented a method to adjust the classifier results. The performance of F-measure attained by our method is 97%.
|
Shared Task:
Oct. 12, 2015(13:30–15:10),
Treasure Palace Hotel-V18 Meeting Room(四楼V18),
Chair: Xiaojun WAN
Return
|
13:30-13:50 |
Jiayuan Chao, Zhenghua Li and Min Zhang
show abstract
hide abstract
ABSTRACT: This paper describes our system designed for the NLPCC 2015 shared task on Chinese word segmentation (WS) and POS tagging for Weibo Text. We treat WS and POS tagging as two separate tasks and use a cascaded approach. Our major focus is how to effectively exploit multiple heterogeneous data to boost performance of statistical models. This work considers three sets of heterogeneous data, i.e., Weibo (WB, 10K sentences), Penn Chinese Treebank 7.0 (CTB7, 50K), and People’s Daily (PD, 280K). For WS, we adopt the recently proposed coupled sequence labeling to combine WB, CTB7, and PD, boosting F1 score from 93.76% (baseline model trained on only WB) to 95.58% (+1.82%). For POS tagging, we adopt an ensemble approach combining coupled sequence labeling and the guide-feature based method, since the three datasets have three different annotation standards. First, we convert PD into the annotation style of CTB7 based on coupled sequence labeling, denoted by PDCTB. Then, we merge CTB7 and PDCTB to train a POS tagger, denoted by TagCTB7+PDCTB , which is further used to produce guide features on WB. Finally, the tagging F1 score is improved from 87.93% to 88.99% (+1.06%).
|
13:50-14:10 |
Jinwei Yuan, Yan Yang, Zhen Jia, Hongfeng Yin, Junfu Huang and Jie Zhu
show abstract
hide abstract
ABSTRACT: Aiming at the task of Entity Recognition and Linking in Chinese Search Queries in NLP&CC 2015, this paper proposes the solutions to entity recognition, entity linking and entity disambiguation. Dictionary, online knowledge base and SWJTU Chinese word segmentation are used in entity recognition. Synonyms thesaurus, redirect of Wikipedia and the combination of improved PED (Pinyin Edit Distance) algorithm and LCS (Longest Common Subsequence) are applied in entity linking. The methods of suffix supplement and link value computation based on online encyclopedia are adopted in entity disambiguation. The experiment results indicate that the proposed solutions in this paper are effective for the case of short queries and insufficient contexts.
|
14:10-14:30 |
Kerui Min, Chenggang Ma, Tianmei Zhao and Haiyan Li
show abstract
hide abstract
ABSTRACT: Chinese word segmentation and POS tagging are arguably the most fundamental tasks in Chinese natural language processing. In this paper, we show an ensemble approach for segmentation and POS tagging, combining both discriminative and generative methods to get the advantage of both worlds. Our approach achieved the F1-score of 96.65% and 91.55% for segmentation and tagging respectively in the contest of NLPCC 2015 Shared Task 1, obtained the 1st place for both tasks.
|
14:30-14:50 |
Zhonglin Ye, Zhen Jia, Yan Yang, Junfu Huang and Hongfeng Yin
show abstract
hide abstract
ABSTRACT: Aiming at open domain question answering system evaluation task in the fourth CCF Natural Language Processing and Chinese Computing Conference (NLPCC2015), a solution of automatic question answering which can answer natural language questions is proposed. Firstly, SPE (Subject Predicate Extraction) algorithm is presented to find answers from the knowledge base, and then WKE (Web Knowledge Extraction) algorithm is used to extract answers from search engine query result. Experimental data provided in the evaluation task includes the knowledge base and questions in natural language. The evaluation result shows that MRR is 0.5670, accuracy is 0.5700, and average F1 is 0.5240, and indicates the proposed method is feasible in open domain question answering system.
|
14:50-15:10 |
|
Chinese Language Computing I:
Oct. 12, 2015(15:40–17:00),
Treasure Palace Hotel-Four Seasons(二楼四季),
Chair: Endong XUN
Return
|
15:40-16:00 |
Yunfang Wu, Fuqiang Wan, Yifeng Xu and Xue Qiang Lv
show abstract
hide abstract
ABSTRACT: This paper proposes a novel method for sentence-level Chinese discourse tree building. We construct a Chinese discourse annotated corpus in the framework of Rhetorical Structure Theory. And we propose a ranking-like SVM (SVM-R) model to automatically build the tree structure, which can capture the relative associated strength among three consecutive text spans rather than only two adjacent spans as most previous approaches do. The experimental results show that our proposed SVM-R method significantly outperforms the state-of-the-art in discourse parsing accuracy. We also demonstrate that the useful features for discourse tree building are consistent with Chinese language characteristics.
|
16:00-16:20 |
Fenfen Shang, Yanhui Gu, Weiguang Qu, Bin Li, Junsheng Zhou and Weiguang Qu
show abstract
hide abstract
ABSTRACT: Semantically understanding words is an essential issue in understanding texts, since it is the basic techniques of understanding texts. However, there are plenty of unknown words. Therefore, it is difficult to make the users understand the content of the texts. We focus on the sense guessing of Chinese unknown words based on “Semantic Knowledge-base of Modern Chinese”. Firstly, we introduced different levels of semantic dictionary. Based on the new dictionaries, we introduced three models and predict sense. In this paper, we integrated each model to predict the unknown words and obtained better prediction performance. We also did work of semantic prediction and annotation of the unknown words in People's Daily which published in 2000 based on each model. Finally, we got the corpus resources with the sense annotation of unknown words.
|
16:20-16:40 |
Huating Xu, YUJIE ZHANG, YANG XIAOHUI, Hua Shan, Jinan XU and Yufeng Chen
show abstract
hide abstract
ABSTRACT: Chinese word segmentation systems, trained on annotated corpus of newspaper, often obviously decrease in performance when faced with a new domain. Since there is no large scale annotated corpus on the target domain, statistics based methods could not work well. In our approach of this paper, we attack domain adaptation of Chinese word segmentation by combining active learning with the statistical features of n-gram. Our idea is to select such a small amount of data for annotation that the gap from the target domain to the News will be overcome. The word segmentation model is trained again by the corpus added with newly annotated data and the statistical features of n-gram from the raw corpus. We use the CRF model for training and a raw corpus of one million sentences on patent description to verify the proposed approach. For test data, 300 sentences are randomly selected and manually annotated. The experimental results show that the performances of the Chinese word segmentation system based on our approach are improved on each evaluation metrics.
|
16:40-17:00 |
Yang Wei, Jinmao Wei and Hengpeng Xu
show abstract
hide abstract
ABSTRACT: To tackle the sparse data problem of the bag-of-words model for document representation, the Context Vector Model (CVM) has been proposed to enrich a document with the relatedness of all the words in a corpus to the document. The nature of CVM is the combination of word vectors, wherefore the representation method for words is essential for CVM. A computational study is performed in this paper to compare the effects of the newly proposed word representation methods embed- ded in CVM. The experimental results demonstrate that some of the newly proposed word representation methods significantly improve the performance of CVM, for they estimate the relatedness between words better.
|
Machine Translation:
Oct. 12, 2015(15:40–17:00),
Treasure Palace Hotel-No.5 Meeting Room(二楼五号),
Chair: Deyi XIONG
Return
|
15:40-16:00 |
Qiang Li, Mu Li, Dongdong Zhang and Jingbo Zhu
show abstract
hide abstract
ABSTRACT: Due to the sparsity of data and the limitation of bilingual data size, many high-quality phrase pairs can’t be generated. This paper generates example-based phrase pairs through decomposing, substituting and generating phrase pairs that generated by the typical phrase extraction method in phrase-based statistical machine translation. On the Chinese-to-English Newswire and Oral translation tasks, the experimental results demonstrate that our methods achieve significant improvements. Moreover, our methods yield a gain of about 1% BLEU score increase on some test sets.
|
16:00-16:20 |
Haiqing Tang and Deyi Xiong
show abstract
hide abstract
ABSTRACT: The existing phrase-based statistical machine translation (SMT) using rather limited semantic knowledge causing the translation quality of long-distance verb and its object is low. The authors propose a selectional preferences based translation model for SMT which inducts the semantic constraints that a verb imposes on its object to select the proper argument-head word for the predicate with long distance. First train the corpus to obtain the conditional probability based selectional preferences for verb. Then integrate the selectional preferences into a phrase-based translation system and evaluate on a Chinese-to-English translation task with large-scale training data. Experiment results show that the integration of selectional preference into SMT can effectively capture the long-distance semantic dependencies and improve the translation quality.
|
16:20-16:40 |
Qinglin Li, Shujie Liu, Rui Lin, Mu Li and Ming Zhou
show abstract
hide abstract
ABSTRACT: Nowadays knowledge base (KB) has been viewed as one of the im- portant infrastructures for many web search applications and NLP tasks. How- ever, in practice the availability of KB data varies from language to language, which greatly limits potential usage of knowledge base. In this paper, we pro- pose a novel method to construct or enrich a knowledge base by entity translation with help of another KB but compiled in a different language. In our work, we concentrate on two key tasks: 1) collecting translation candidates with as good coverage as possible from various sources such as web or lexicon; 2) building an effective disambiguation algorithm based on collective inference approach over knowledge graph to find correct translation for entities in the source knowledge base. We conduct experiments on movie domain of our in-house knowledge base from English to Chinese, and the results show the proposed method can achieve very high translation precision compared with classical translation methods, and significantly increase the volume of Chinese knowledge base in this domain.
|
16:40-17:05* |
Turdi Tohti, Winira Musajan and Askar Hamdulla
show abstract
hide abstract
ABSTRACT: This paper puts forward a new idea and related algorithms for Uyghur segmentation. In this algorithm, the word based Bi-gram and contextual information are derived from large scale raw corpus automatically, and according to the Uyghur word association rules, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement to estimate the association strength between two adjacent Uyghur words, and found the weakly associated inter-word position and take it as a segmentation point obtained the perfect word strings both on its semantics and structural integrity .The experimental result on large-scale corpus shows that the proposed algorithm achieves 88.21% segmentation accuracy.
|
Evaluation/Open Fund Workshop(Chinese Session):
Oct. 12, 2015(15:40–17:00),
Treasure Palace Hotel-V18 Meeting Room(四楼V18),
Chair: Zhi TANG
Return
|
15:40-16:00 |
NLPCC Task Debriefing
Xiaojun WAN
|
16:00-16:20 |
云存储环境下海量隐私文本数据的安全产寻方法
宋伟
show abstract
hide abstract
ABSTRACT: No
|
16:20-16:40 |
基于公式语义的数学搜索关键问题研究
苏伟
show abstract
hide abstract
ABSTRACT: No
|
16:40-17:00 |
湘西民间苗文的字形动态生成方法及OpenType字库研究
莫礼平
show abstract
hide abstract
ABSTRACT: No
|
Poster/Demo Presentations and Banquet:
Oct. 12, 2015(17:30–21:10),
Treasure Palace Hotel-Palace Grand Ballroom(碧丽宫),
Return
|
17:30-19:00 |
Lin Zhao, Ning Li, Qi Liang and Xin Peng
show abstract
hide abstract
ABSTRACT: The basic idea of re-flowable document understanding and automatic typesetting is to generate logical documents by judging the hierarchical relation- ship of physical units and logical tags based on the identification of logical para- graph tags in re-flowable document. In order to overcome the shortages of con- ventional logical structure reconstruction methods, a novel logical structure re- construction method of re-flowable document based on directed graph is pro- posed in this paper. This method extracts the logical structure from the template document and then utilizes directed graph's single-source shortest path algorithm to filter out redundant logical tags, thus solving the problem of logical structure reconstruction of a document. Experimental results show that the algorithm can effectively improve the accuracy of logical structure recognition.
|
17:30-19:00 |
Ye Tengju, Xie Zhipeng and Li Ang
show abstract
hide abstract
ABSTRACT: Canonical Correlation Analysis (CCA) is a standard statistical tech- nique for finding linear projections of two arbitrary vectors that are maximally correlated. In complex situations, the linearity of CCA is not applicable. In this paper, we propose a novel local method for CCA to handle the non-linear situa- tions.We aim to find a series of local linear projections instead of a single globe one. We evaluate the performance of our method and CCA on two real-world datasets. Our experiments show that local method outperforms original CCA in several realistic cross-modal multimedia retrieval tasks.
|
17:30-19:00 |
Chu Wang, Shi Feng, Daling Wang and Yifei Zhang
show abstract
hide abstract
ABSTRACT: Most existing sentiment analysis methods focus on single-label clas- sification, which means only a exclusive sentiment orientation (negative, posi- tive or neutral) or an emotion state (joy, hate, love, sorrow, anxiety, surprise, anger, or expect) is considered for the given text. However, multiple emotions with different intensity may be coexisting in one document, one paragraph or even in one sentence. In this paper, we propose a fuzzy-rough set based ap- proach to detect the multi-labeled emotions and calculate their corresponding intensities in social media text. Using the proposed fuzzy-rough set method, we can simultaneously model multi emotions and their intensities with sentiment words for a sentence, a paragraph, or a document. Experiments on a well- known blog emotion corpus show that our proposed multi-labeled emotion in- tensity analysis algorithm outperforms baseline methods by a large margin.
|
17:30-19:00 |
Yingbin Liu, Yannan Sun and Endong Xun
show abstract
hide abstract
ABSTRACT: Chinese calligraphy alignment is to establish correspondence between two Chinese characters by measuring the similarity by certain rules, and apply transformation accordingly. This paper presents an innovative method to align two glyph contours with three steps. First, 2D Bézier curve control points of glyph contours of each character are expanded into 3D space; Second, a Gaussian Mixture Model(GMM) is constructed using this 3D point set; Finally, we establish alignment by minimizing the Euclidean distance(L2) between two GMMs and then apply transformation accordingly. Expansion to 3D space helps make use of inherent constraints of Chinese calligraphy beyond 2D coordinates; The advantage of using Gaussian mixture model is to maintain both the overall shape property and the local writing features during the alignment process. Experiments were conducted to verify the feasibility and effectiveness of this method and the results show that it performed well for both single stroke and whole character.
|
17:30-19:00 |
Yuan Liao, Xiaoqing Lu, Zhi TANG, Yongtao Wang and Jianling Sun
show abstract
hide abstract
ABSTRACT: Symmetry is a significant structural property of Chinese characters. However, the limited texture features and the lack of efficient quantitative description methods hinder us from fully understanding and tapping into the symmetry of Chinese characters. This study proposes a symmetry detection method that combines different types of character features, such as scale invariant feature transform(SIFT) and contour information. A directed graph is constructed with the basic symmetric elements of a character to describe the enhancement relationships among the elements. Furthermore, the detection of the most significant axes of symmetry in one character is transformed into the problem of finding star subgraphs with local maximum weight. Experiments show that the proposed method outperforms the existing methods on the dataset we established.
|
17:30-19:00 |
Yixiu Wang, Yunfang Wu and Xueqiang lv
show abstract
hide abstract
ABSTRACT: We present a multi-sentence question segmentation strategy for community question answering services to alleviate the complexity of long sen- tences. We develop a complete scheme and make a solution to complex- question segmentation, including a question detector to extract question sen- tences, a question compression process to remove duplicate information, and a graph model to segment multi-sentence questions. In the graph model, we train a SVM classifier to compute the initial weight and we calculate the authority of a vertex to guide the propagating. The experimental results show that our meth- od gets a good balance between completeness and redundancy of information, and significantly outperforms state-of-the-art methods.
|
17:30-19:00 |
Changge Chen, Hai Zhao and Yang Yang
show abstract
hide abstract
ABSTRACT: This paper focuses on improving a specific opinion spam detection task, de- ceptive spam. In addition to traditional word form and other shallow syntactic features, we introduce two types of deep level linguistic features. The first type of features are derived from a shallow discourse parser trained on Penn Discourse Treebank (PDTB), which can capture inter-sentence information. The second type is based on the relation- ship between sentiment analysis and spam detection. The experimental results over the benchmark dataset demonstrate that both of the proposed deep features achieve improved performance over the baseline.
|
17:30-19:00 |
Li Maolin and Tan Yongmei
show abstract
hide abstract
ABSTRACT: Entity Linking is the process of linking name mentions in text with their referent entities in a knowledge base. This paper tackles this task by proposingan approach based on topic-sensitive random walk with restart. Firstly, the context information of mentions are used to expand mentions and search the candidate entities in Wikipedia knowledge base for mentions; Secondly, graph can be constructed in accordance with the intermediate result in the pre step.Finally, the topic-sensitive random walk with restart model is used to rank the candidate entities and choose the top1 as the linked entity. Experimental results show that this approach on KBP2014 data set gets F score 0.623 which is higher than every other systems’ mentioned in this paper. The proposed approach can improve the Entity Linking system’s performance.
|
17:30-19:00 |
Xi Xu, Mao Ye, Zhi TANG, Jianbo Xu and Liangcai Gao
show abstract
hide abstract
ABSTRACT: With the coming of digital newspaper, user-oriented special topic generation becomes extremely urgent to satisfy the users’ requirements both functionally and emotionally. We propose an applicable automatic special topic generation system for digital newspapers based on users’ interests. Firstly, extract subject heading vector of the topic of interest by filtering out function words, localizing Latent Dirichlet Allocation (LDA) and training the LDA model. Secondly, remove semantically repetitive vector component by constructing a synonymy word map. Lastly, organize and refine the special topic according to the similarity between the candidate news and the topic, and the density of topic-related terms. The experimental results show that the system has both simple operation and high accuracy, and it is stable enough to be applied for user-oriented special topic generation in practical applications.
|
17:30-19:00 |
Mingyang Li, Yao Shi, Zhigang Wang and Yongbin Liu
show abstract
hide abstract
ABSTRACT: Cross-LingualKnowledgeBasesareveryimportantforglobal knowledge sharing. However, there are few Chinese-English knowledge bases due to the following reasons: 1) the scarcity of Chinese knowledge in existing cross-lingual knowledge bases; 2) the limited number of cross- lingual links; 3) the incorrect relationships in semantic taxonomy. In this paper, a large-scale Cross-Lingual Knowledge Base(named XLORE) is built to address the above problems. Particularly, XLORE integrates four online wikis including English Wikipedia, Chinese Wikipedia, Baidu Baike and Hudong Baike to balance the knowledge volume in different languages, employs a link-discovery method to augment the cross-lingual links, and introduces a pruning approach to refine taxonomy. Totally, XLORE harvests 663,740 classes, 56,449 properties, and 10,856,042 in- stances, among of which, 507,042 entities are cross-lingually linked. At last, we provide an online cross-lingual knowledge base system support- ing two ways to access established XLORE, namely a search engine and a SPARQL endpoint.
|
17:30-19:00 |
Wei Chen and Bo Xu
show abstract
hide abstract
ABSTRACT: Hierarchical phrase-based translation models have advanced statistical machine translation (SMT). Because such models can improve leveraging of syntactic information, two types of methods (leveraging source parsing and leveraging shallow parsing) are applied to introduce syntactic constraints in- to translation models. In this paper, we propose a bilingually-constrained recursive neural network (BC-RNN) model to combine the merits of these two types of methods. First we perform supervised learning on a manual- ly parsed corpus using the standard recursive neural network (RNN) model. Then we employ unsupervised bilingually-constrained tuning to improve the accuracy of the standard RNN model. Leveraging the BC-RNN model, we introduce both source parsing and shallow parsing information into a hier- archical phrase-based translation model. The evaluation demonstrates that our proposed method outperforms other state-of-the-art statistical machine translation methods for National Institute of Standards and Technology 2008 (NIST 2008) Chinese-English machine translation testing data.
|
17:30-19:00 |
Qing Xia, Xin Yan, Zhengtao Yu and Shengxiang Gao
show abstract
hide abstract
ABSTRACT: Named entity equivalent has been playing a significant role in the processing of cross-language information. However limited by the corpora re- source, few in-depth studies have been made on the extraction of the bilingual Chinese-Khmer named entity equivalents. On account of this, this paper pro- poses a Wikipedia-based approach, utilizes the internal web links in Wikipedia and computes the feature similarity to extract the bilingual Chinese-Khmer named entity equivalents. The experimental result shows that good effect has been achieved when the entity equivalents are acquired through the internal web links in Wikipedia with F value up to 90.67%. Also it shows that the result is quite favorable when the bilingual Chinese-Khmer named entity equivalents are acquired through the computation of feature similarity, turning out that the me- thod proposed in this paper is able to give better effect.
|
17:30-19:00 |
Yijiang Chen, Tingting Zhu, Chang Su and xiaodong shi
show abstract
hide abstract
ABSTRACT: In this paper, we transform the issue of Chinese-English tense con- version into the issue of tagging a Chinese tense tree. And then we propose Markov Tree Tagging Model to tag nodes of the untagged tense tree with Eng- lish tenses. Experimental results show that the method is much better than line- ar-based CRF tagging for the issue.
|
17:30-19:00 |
Liping Du; Xiaoge Li; Gen Yu; Chunli Liu; Rui Liu
show abstract
hide abstract
ABSTRACT: Chinese word segmentation is the fundamental of Chinese natural language processing and information extraction. With rapid development of Web 2.0 technology, internet new words recognition is the main problem and bottleneck for Chinese segmentation. We present an unsupervised method for identifying internet new words from the large scale web corpus, which combines with an improved Point-wise Mutual Information (PMI), PMIk algorithm, and some basic rules. This method can recognize internet new words with length from 2 to n (n could be defined as any number as needed). Experimented based on 257MB Baidu Tieba corpus, the precision of our system achieved 97.39% when the parameter value of PMIk algorithm is equal to 10, the precision increased 28.79% comparing to PMI method, the results show that our system is significant and efficient for detecting new word from the large scale web corpus. Compiling the results of new word discovery into user dictionary and then loading the user dictionary into ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System), experimented with 10KB Baidu Tieba corpus, the precision, the recall and F-Measure were promoted 7.93%、3.73% and 5.91% respectively comparing to ICTCLAS, the result show that new word discovery could improve the performance of segmentation for web corpus significantly. Key words New word recognition; Unknown word; PMI; Improved PMI algorithm; Chinese Word Segmentation
|
17:30-19:00 |
Ping Yang; Hongxu Hou; Yupeng Jiang; Zhipeng Shen; Jian Du
show abstract
hide abstract
ABSTRACT: Chinese to Slavic Mongolian Named Entity Translation in cross Chinese and Slavic Mongolian information processing has a very important significance. However, using the machine translation method directly cannot achieve satisfactory result. In order to solve the above problem, a novel approach was proposed to extract Chinese-Slavic Mongolian Named Entity pairs automatically. Only the Chinese named entities need to be identified, then extracting all of the candidate Named Entity pairs using sliding window method based on HMM word alignment result. Finally filtering all of the candidate Named Entity translation unit based on Max Entropy Model integrated with four features, and choose the most probable aligned Slavic Mongolian NE of the Chinese NE is. Experimental results show that this approach outperforms HMM model, achieves high quality of Chinese-Slavic Mongolian Named Entity pairs with relatively high precision, even though sometimes the word alignment result is partially correct.
|
17:30-19:00 |
Amandyk Kartbayev
show abstract
hide abstract
ABSTRACT: Word alignment play an important role in the training of statistical machine translation systems. We present a technique to refine word alignments at phrase level after the collection of sentences from the Kazakh-English parallel corpora. The estimation technique extracts the phrase pairs from the word alignment and then incorporates them into the translation system for further steps. Although it is a pretty important step in training procedure, an word alignment process often has practical concerns with agglutinative languages. We consider an approach, which is a step towards an improved statistical translation model that incorpo- rates morphological information and has better translation performance. Our goal is to present a statistical model of the morphology dependent procedure, which was evaluated over the Kazakh-English language pair and has obtained an improved BLEU score over state-of-the-art models.
|
17:30-19:00 |
chunyue zhang and Tiejun Zhao
show abstract
hide abstract
ABSTRACT: Distributed word representation has been found to be highly effective to extract a bilingual lexicon from comparable corpora by a sim- ple linear transformation. However, polysemous words often vary their meanings at different time points in the corresponding corpora. A sin- gle word representation which is learned from the whole corpora can’t express the temporal change of the word meaning very well. This paper proposes a simple solution which exploits the temporal distributed word representation for polysemous words. The experimental results confirm that the proposed solution can offer better performance on the English- to-Chinese bilingual lexicon extraction task.
|
17:30-19:00 |
Jing Li, Zhongyu Wei, Hao Wei, Kangfei Zhao, Junwen Chen and Kam-Fai Wong
show abstract
hide abstract
ABSTRACT: Microblogging websites have emerged to the center of infor- mation production and diffusion, on which people can get useful infor- mation from other users’ microblog posts. In the era of Big Data, we are overwhelmed by the large amount of microblog posts. To make good use of these informative data, an effective search tool is required specialized for microblog posts. However, it is not trivial to do microblog search due to the following reasons: 1) microblog posts are noisy and time-sensitive rendering general information retrieval models ineffective. 2) Convention- al IR models are not designed to consider microblog-specific features. In this paper, we propose to utilize learning to rank model for microblog search. We combine content-based, microblog-specific and temporal fea- tures into learning to rank models, which are found to model microblog posts effectively. To study the performance of learning to rank models, we evaluate our models using tweet data set provided by TERC 2011 and TREC 2012 microblogs track with the comparison of three state- of-the-art information retrieval baselines, vector space model, language model, BM25 model. Extensive experimental studies demonstrate the effectiveness of learning to rank models and the usefulness to integrate microblog-specific and temporal information for microblog search task.
|
17:30-19:00 |
Yanyan Zhao, Bing Qin and Ting Liu
show abstract
hide abstract
ABSTRACT: Current online public opinion analysis systems can explore lots of hot events and present the public emotion distribution for each event, which are useful for the governments and companies. However, the public emotion distributions are just the shallow analysis of the hot events, more and more people want to know the hidden causation be- hind the emotion distributions. Thus, this paper presents a deep Event- Emotion analysis system on Microblogs to reveal what causes different emotions of a hot event. We here use several related sub-events to de- scribe a hot event in different perspectives, accordingly these sub-events combined with their different emotion distributions can be used to ex- plain the total emotion distribution of a hot event. Experiments on 15 hot events show that the above idea is reasonable to exploit the emotion causation and can help people better understand the evolution of the hot event. Furthermore, this deep Event-Emotion analysis system also tracks the amount treads and emotion treads of the hot event, and presents the deep analysis based on the user profile.
|
17:30-19:00 |
Linlin Shi; Likun Qiu; Shiyong Kang
show abstract
hide abstract
ABSTRACT: Phrase structure tree and dependency tree are two forms of treebank, and can be converted to each other. In this paper, we try to transform dependency tree into phrase structure tree, and detect annotation errors automatically based on manual rules. The method has been used in processing Peking University Multi-view Chinese Treebank (PMT). Although PMT has been manually checked twice before processed by our method, 1529 errors were detected among the 50275 sentences and the precision is 100%. The errors mainly belong to three types: word segmentation error, mismatching between POS and syntactic role, and syntactic role error. This method can further improve treebank quality, and be applied to other dependency treebanks.
|
17:30-19:00 |
Yichu Zhou, Shujian Huang, XIN-YU DAI and Jiajun CHEN
show abstract
hide abstract
ABSTRACT: Coordinate structures are linguistic structures consisting of two or more conjuncts, which usually compose into larger constituent as a whole unit. How- ever, the boundary of each conjunct is difficult to identify, which makes it difficult to parse the whole coordinate and larger structures. In labeled data, such as the Penn Chinese Tree Bank (CTB), coordinate structures are not labeled explicitly, which makes solving the problem more complicated. In this paper, we treat re- solving coordinate structures as an independent sub-problem of parsing. We first define coordinate structures explicitly and design rules to extract the coordinate structures from labeled CTB data. Then a specifically designed grammar is pro- posed for automatic parsing of coordinate structures. We propose two groups of new features to better model coordinate structures in a shift-reduce parsing frame- work. Our approach can achieve a 15% improvement in F-1 score on resolving coordinate structures.
|
17:30-19:00 |
Zhengxian Gong and Guodong Zhou
show abstract
hide abstract
ABSTRACT: Document-level Machine Translation (MT) has been draw- ing more and more attention due to its potential of resolving sentence- level ambiguities and inconsistencies with the benefit of wide-range con- text. However, the lack of simple yet effective evaluation metrics largely impedes the development of such document-level MT systems. This pa- per proposes to improve traditional MT evaluation metrics by simplified lexical chain, modeling document-level phenomena from the perspectives of text cohesion. Experiments show the effectiveness of such method on evaluating document-level translation quality and its potential of inte- grating with traditional MT evaluation metrics to achieve higher corre- lation with human judgments.
|
17:30-19:00 |
Yanhui Gu, Weiguang Qu, Yonggen Wang, Suoliang Jiang, Junsheng Zhou, Yunfei Long and Weiguang Qu
show abstract
hide abstract
ABSTRACT: Effective and efficient retrieval of similar spatial textual objects plays an important role for many location based applications, such as Jiepang, Foursquare services, and so forth. Most of them focus on how to integrate spatial and textual information to efficiently retrieve top-k results yet few of them address the effectiveness issue. In this paper, we propose a semantic aware strategy which can effectively and efficiently retrieve the top-k similar spatial textual objects based on a general framework. Extensive experimental evaluation demonstrates that the performance of our proposal outperforms the state-of-the-art approach.
|
17:30-19:00 |
Xiangdong Su, Guanglai Gao, Yupeng Jiang, Jing Wu and Feilong Bao
show abstract
hide abstract
ABSTRACT: Inflection suffix is an important morphological characteristic of Mongolian words, since the suffixes express abundant syntactic and semantic meanings. In order to provide an informative introduction of it, this paper im- plements a case study on it. Through three Mongolian NLP tasks, we disclose the following information: (1) views of inflection suffix in NLP tasks, (2) In- flection suffix processing ways, (3) Inflection suffix effects on system perform- ance and (4) some suffix related conclusion.
|
17:30-19:00 |
Maofu Liu, Limin Wang and Liqiang Nie
show abstract
hide abstract
ABSTRACT: The past several years have witnessed the rapid development of so- cial media services, and the UGCs (User Generated Contents) have been in- creased dramatically, such as tweets in Twitter and posts in Sina Weibo. In this paper, we describe our system at NLPCC2015 on the Weibo-oriented Chinese news summarization task. Our model is established based on multi-feature combination to automatically generate summary for the given news article. In our system, we mainly utilize four kinds of features to compute the significance score of a sentence, including term frequency, sentence position, sentence length and the similarity between sentence and news article title, and then the summary sentences are chosen according to the significance score of each sen- tence from the news article. The evaluation results on Weibo news document sets show that our system is efficient in Weibo-oriented Chinese news summa- rization and outperforms all the other systems.
|
17:30-19:00 |
Jun Li, Jinxian Pan, Chen Ye, Yong Huang, Zhichun WANG and Danlu Wen
show abstract
hide abstract
ABSTRACT: This paper presents our approach for NLPCC 2015 shared task, Entity Recognition and Linking in Chinese Search Queries. The pro- posed approach takes a query as input, and generates a ranked mention- entity links as results. It combines several different metrics to evaluate the probability of each entity link, including entity relatedness in the given knowledge graph, document similarity between query and the vir- tual document of entity in the knowledge graph. In the evaluation, our approach gets 33.2% precision and 65.2% recall, and ranks the 6th among all the 14 teams according to the average F1-measure.
|
17:30-19:00 |
Zhengting Yu, XIN-YU DAI, Si Shen, Shujian Huang and Jiajun CHEN
show abstract
hide abstract
ABSTRACT: This paper describes the model we designed for the Chinese word segmentation Task of NLPCC 2015. We firstly apply a word-based perceptron algorithm to build the base segmenter. Then, we use a Boot- strap Aggregating model of bagging which improves the segmentation results consistently on the three tracks of closed, semi-open and open test. Considering the characteristics of Weibo text, we also perform rule- based adaptation before decoding. Finally, our model achieves F-score 95.12% on closed track, 95.3% on semi-open track and 96.09% on open track.
|
17:30-19:00 |
Gongbo TANG, Yuting Guo, Dong Yu and Endong XUN
show abstract
hide abstract
ABSTRACT: In this paper, we construct an entity recognition and linking system using Chinese Wikipedia and knowledge base. We utilize refined filter rules in entity recognition module, and then generate candidate entities by search en- gine and attributes in Wikipedia article pages. In entity linking module, we pro- pose a hybrid entity re-ranking method combined with three features: textual and semantic match-degree, the similarity between candidate entity and entity mention, entity frequency. Finally, we get the linking results by the entity’s final score. In the task of entity recognition and linking in search queries at NLPCC 2015, the Average-F1 value of this method achieved 61.1% in 3849 test dataset, which ranks second place in fourteen teams.
|
17:30-19:00 |
Xin Zhou
show abstract
hide abstract
ABSTRACT: Trie tree, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. It makes the search and update of words more efficient and is widely used in the construction of English dictionary for the storage of English vocabulary. Within the application of big data, efficiency determines the availability and usability of a system. In this paper, I introduce p-trie tree-a novel trie tree structure which can be used for polysemantic data which are not limited to English strings. I apply p-trie to the storage of Japanese vocabulary and evaluate the performance through experiments.
|
Chinese Language Computing II:
Oct. 13, 2015(14:00–15:20),
Treasure Palace Hotel-Four Seasons(二楼四季),
Chair: Yansong FENG
Return
|
14:00-14:20 |
Xudong Chen, Zhouhui Lian, Jianguo Xiao and Yingmin Tang
show abstract
hide abstract
ABSTRACT: How to automatically and accurately extract strokes from Chinese characters is one of the most important and challenging tasks in areas of document analysis, computer graphics and pattern recog- nition. However, currently no benchmark is available to evaluate the performance of stroke extraction methods. In this paper, we present a benchmark, which includes a manually-constructed database and evalu- ation tools, to address this problem. Specifically, the database contains a number of images of Chinese characters represented in four commonly- used font styles and corresponding stroke images manually segmented from character images. Performance of a given stroke extraction method can be evaluated by calculating dissimilarities of the automatic segmen- tation results and the ground truth using two specially-designed metrics. Moreover, we also propose a new method based on Delaunay triangula- tion to effectively extract strokes from Chinese characters. Experimental results obtained by comparing three algorithms demonstrate that the benchmark works well for the evaluation of stroke extraction approaches and the proposed method performs considerably well in the application of stroke extraction for Chinese characters.
|
14:20-14:40 |
Hao BAI
show abstract
hide abstract
ABSTRACT: Extracting characters from digital ink text is an essential step which leads to more reliable recognition of text and also a prerequisite for structured editing. Casualness and diversity of handwriting input result in unsatisfied accu- racy of extracted characters. Reprocessing the initial extracted characters based on context makes some considerable improvement. Therefore, this paper pro- poses an approach to adaptively extracting characters from digital ink text in Chi- nese based on extracted errors. The approach firstly classified the extracted errors in the primary extraction. According to different types of extracted errors, the approach gives different operations. Experimental data shows that the approach is effective.
|
14:40-15:00 |
liu bojia, xu jinan, chen yufeng and zhang yujie
show abstract
hide abstract
ABSTRACT: The main method of machine transliteration contents of the pheneme-based method and the grapheme-based method. Both of them retain certain limitations: The former one needs multi-steps to transfer pronunciation between different languages, which would increase errors. The other , due to ignoring the importance of the phoneme, which would cause loss of information issues. At the same time, the larger difficulty of machines transliteration between different language systems is the performance of segmentation and the accuracy of alignment of transliteration units in parallel corpus. Therefore, this paper proposed a new method for transliteration unit alignment which integrated the two main transliteration methods. Experimental results show that this method outperforms other methods in terms of performance in machine transliteration.
|
15:00-15:25* |
Liping Mo and Kaiqing Zhou
show abstract
hide abstract
ABSTRACT: To effectively solve the glyph generation and glyph description problem, a dynamical glyph generation method of Xiangxi folk Hmong characters is proposed. According to this method, the glyph generation process can be described as a combination arithmetic expression. Hmong characters component acts as the operand, and the location relationship between the components decides the operator. Glyphs in different structure can be dynamically generated by combination of two or three components. Further, if combination arithmetic expression is converted to ideographic description sequence (IDS), the proposed method can be implemented with the help of the IDS explain mechanism of operation system. Test results illustrate that, the Xiangxi Hmong characters glyph, which generated by the mapping script based on the proposed method, can meet practical requirements.
|
Event and Topic Models:
Oct. 13, 2015(14:00–15:20),
Treasure Palace Hotel-No.5 Meeting Room(二楼五号),
Chair: Zhichun WANG
Return
|
14:00-14:20 |
Shaohua Zhu, Peifeng Li and Qiaoming Zhu
show abstract
hide abstract
ABSTRACT: Currently, previous Chinese argument extraction approaches mainly focus on feature engineering and trigger expansion, which cannot exploit inner relation between trigger mentions in same document. To address this issue, this paper brings forward a novel trigger inference mechanism, which based on Markov Logic Network. It uses head morpheme, the probabilities of a trigger mention fulfilling true and pseudo events from the training set and the relationships between trigger mentions to those trigger mentions with lack of effective context information or low confidences. Experimental results on the ACE 2005 Chinese corpus show that our approach outperforms the baseline, with the improvements of 3.65% and 2.51% in trigger identification and event type classification respectively.
|
14:20-14:40 |
Jiayue Teng; Peifeng Li; Qiaoming Zhu
show abstract
hide abstract
ABSTRACT: Currently, most pairwise resolution models for event co-reference focused on classification or clustering approaches, which ignored the relations between events in a document. This paper proposes a global optimization model for event co-reference resolution to resolve the inconsistent event chains in classifier-based approaches. This model regards co-reference resolution as a, Integer Linear Program problem and introduces various kinds constraints, such as symmetry, transitivity, triggers, argument roles, event distances, to further improve the performance. The experimental results show that our model outperforms the local classifier by 4.20% in F1-measure.
|
14:40-15:00 |
jiabing fu and shoubin dong
show abstract
hide abstract
ABSTRACT: Current studies merely focus on a story chain’s similarity of topic relationship and importance of documents, whilst almost ignoring its logical coherency and explainability. Along with algorithm complexity brought about by exponential growth in sets of news data, this paper attempts to construct a story chain from word coverage perspective, taking advantage of the story comments to position the turning point of each event, then using the ideas of similarity of topic relationship and sparsity differences as well as RPCA approach to conduct logical modeling for the documents, finally doing random walk and graph traversals to quantify and construct an explainable and logically coherent story chain. The double-blind experiment reveals that our method outperforms other algorithms.
|
15:00-15:20 |
Rui Cai, Miaohong Chen and Houfeng WANG
show abstract
hide abstract
ABSTRACT: Topic model aims to analyze collection of documents and has been widely used in the fields of machine learning and natural lan- guage processing. Recently, researchers proposed some topic models for multilingual parallel or comparable documents. The symmetric corre- spondence Latent Dirichlet Allocation (SymCorrLDA) is one such mod- el. Despite its advantages over some other existing multilingual topic models, this model is a classic Bayesian parametric model, thus can’t overcome the shortcoming of Bayesian parametric models. For example, the number of topics must be specified in advance. Based on this intu- ition, we extend this model and propose a Bayesian nonparametric model (NPSymCorrLDA). Experiments on Chinese-English datasets extracted from Wikipedia1 show significant improvement over SymCorrLDA.
|
NLP for Search:
Oct. 13, 2015(14:00–15:20),
Treasure Palace Hotel-V18 Meeting Room(四楼V18),
Chair: Zhengtao YU
Return
|
14:00-14:20 |
Zhongda Xie and Yunqing Xia
show abstract
hide abstract
ABSTRACT: Term dependency models are generally better than bag-of- word models, because complete concepts are often represented by mul- tiple terms. However, without semantic knowledge, such models may introduce many false dependencies among terms, especially when the document collection is small and homogeneous(e.g. newswire documents, medical documents). The main contribution of this work is to incorpo- rate semantic knowledge with term dependency models, so that more accurate dependency relations will be assigned to terms in the query. In this paper, experiments will be made on CLEF2013 eHealth Lab medical information retrieval data set, and the baseline term dependency mod- el will be the popular MRF(Markov Random Field) model [1], which proves to be better than traditional independent models in general do- main search. Experiment results show that, in medical document re- trieval, full dependency MRF model is worse than independent model, it can be significantly improved by incorporating semantic knowledge. Keywords: semantic knowledge, term dependency model, Markov Ran- dom Field, medical document retrieval
|
14:20-14:40 |
Wei Song, Zhiyong Peng and Yihui Cui
show abstract
hide abstract
ABSTRACT: Nowadays, more and more Internet users use the cloud stor- age services to store their personal data, especially when the mobile devices which have limited storage capacity popularize. With the cloud storage services, the users can access their personal data at any time and anywhere without storing the data at local. However, the cloud storage service provider is not completely trusted. Therefore, the first concern of using cloud storage services is the data security. A straightforward method to address the security problem is to encrypt the data before uploading to the cloud server. The encryption method is able to keep the data secret from the cloud server, but cloud server also can not manipulate the data after encryption. It will greatly undermine the ad- vantage of the cloud storage. For example, a user encrypts his personal data before uploading them to the cloud. When he wants to access some data at the cloud, he has to download all the data and decrypt them. Obviously, this service mode will incur the huge overheads of commu- nication and computation. Several related works have been proposed to enable the search over the encrypted data, but all of them only sup- port the encrypted keyword search. In this paper, we propose a new full-text retrieval algorithm over the encrypted data for the scenario of cloud storage, in which all the words in a document have been ex- tracted and built a privacy-preserved full-text retrieval index. Based on the privacy-preserved full-text retrieval index, cloud server can execute full-text retrieval over the large scale encrypted documents. The numer- ical analysis and experimental results further validate the high efficiency and scalability of the proposed algorithm.
|
14:40-15:00 |
Jingfei Li, Dawei Song and Peng Zhang
show abstract
hide abstract
ABSTRACT: Session search aims to improve ranking effectiveness by in- corporating user interaction information, including short-term interac- tions within one session and global interactions from other sessions (or other users). While various session search models have been developed and a large number of interaction features have been used, there is a lack of a systematic investigation on how different features would influence the session search. In this paper, we propose to classify typical interac- tion features into four categories (current query, current session, query change, and collective intelligence). Their impact on the session search performance is investigated through a systematic empirical study, under the widely used Learning-to-Rank framework. One of our key findings, d- ifferent from what have been reported in the literature, is: features based on current query and collective intelligence have a more positive influ- ence than features based on query change and current session. This would provide insights for development of future session search techniques.
|
15:00-15:20 |
Xiao-yun Li and Ying Yu
show abstract
hide abstract
ABSTRACT: During a search task, a user’s search intention is possible inaccurate. Even with clear information need, it is probable that the search query cannot precisely describe the user’s need. And besides, the user is utterly impossible browse all the returned results. Thus, a selected and valuable returned search list is quite important for a search system. Actually, there are lots of reliable and highly relevant personal documents existing in a user’s personal computer. Based on the desktop documents, it is relevantly easy to understand the user’s current knowledge level about the present search subject, which is useful to predict a user’s need. An approach was proposed to exploit the potential of desktop context to refine the search returned list. Firstly, to attain a comprehen- sive long-term user model, the operational history and a series of time-related information were analyzed to achieve the attention degree that a user paid to a document. And the keywords and user tags were focused on to understand the content. Secondly, working scenario was regarded as the most valuable infor- mation to construct a short-term user model, which directly suggested what ex- actly a user was working on. Experiment results showed that desktop context could effectively help refine the search returned results, and only the effectively combination of the long-term user model and the short-term user model could offer more relevant items to satisfy the user.
|
NLP for Social Media:
Oct. 13, 2015(15:40–17:00),
Treasure Palace Hotel-Four Seasons(二楼四季),
Chair: Bin WANG
Return
|
15:40-16:00 |
Qiao Zhang, Shuiyuan Zhang, Jian Dong, Jinhua Xiong and Xueqi Cheng
show abstract
hide abstract
ABSTRACT: Therumordetectionproblemonsocialnetworkhasattracted considerable attention in recent years. Most previous works focused on detecting rumors by shallow features of messages, including content and blogger features. But such shallow features cannot distinguish between rumor messages and normal messages in many cases. Therefore, in this paper we propose an automatic rumor detection method based on the combination of new proposed implicit features and shallow features of the messages. The proposed implicit features include popularity orienta- tion, internal and external consistency, sentiment polarity and opinion of comments, social influence, opinion retweet influence, and match degree of messages. Experiments illustrate that our rumor detection method obtain significant improvement compared with the state-of-the-art ap- proaches. The proposed implicit features are effective in rumor detection on social network.
|
16:00-16:20 |
Feng Liu, Bingquan Liu, Chengjie SUN, Ming Liu and Xiaolong Wang
show abstract
hide abstract
ABSTRACT: The link prediction problem in social networks is to esti- mate the value of the link that can represent relationship between social members. Researchers have proposed several methods for solving link prediction and a number of features have been used. Most of these mod- els are learned with only considering the features from one kind of data. In this paper, by considering the data from link network structure and user comment, both of which could imply the concept of link value, we propose multimodal learning based approaches to predict the link val- ues. The experiment results done on dataset from typical social networks show that our model could learn the joint representation of these datas properly, and the method MDBN outperforms other state-of-art link pre- diction methods.
|
16:20-16:40 |
Bo Jiang, Ying Sha and Lihong Wang
show abstract
hide abstract
ABSTRACT: Mention is an important interactive behavior used to explic- itly refer to target users for specific information in social networks. Un- derstanding user mention behavior can provide important insights into questions of human social behavior and improve design of social network platforms. However, most previous works mainly focus on mentioning for the effect of information diffusion, few researches consider the problem of mention behavior prediction. In this paper, we propose an intuitive approach to predict user mention behavior using link prediction method. Specifically, we first formulate user mention prediction problem as a clas- sification task, and then extract new features including semantic interest match, social tie, mention momentum and interaction strength to im- prove the performance of prediction. To evaluate the proposed approach, we conduct extensive experiments on Twitter dataset. The experimen- tal results clearly show that our approach has 15% increase in precision compared with the best baseline method.
|
16:40-17:00 |
Ya Su, Jie Liu and Yalou Huang
show abstract
hide abstract
ABSTRACT: The authors design recognition features with the consideration of medical field characteristic for the online medical text, and the experiment of the entity recognition is carried out on the self-built data set. Concerned about five common diseases: gastritis, lung cancer, asthma, hypertension and diabetes. In the experiment, an advanced machine learning model Conditional Random Field is used for training and testing, and the target entities include five kinds: disease, symptoms, drugs, treatment methods and check. The effectiveness of the proposed features is verified by using the experimental method, and the accuracy of the total 81.26% is obtained and the recall rate is 60.18%. Subsequently, the further analysis is given for the recognition features.
|
Knowledge Acquisition and Applications:
Oct. 13, 2015(15:40–17:00),
Treasure Palace Hotel-No.5 Meeting Room(二楼五号),
Chair: Guangyou ZHOU
Return
|
15:40-16:00 |
Mengdi Zhang, Tao Huang, Yixin Cao and Lei Hou
show abstract
hide abstract
ABSTRACT: Frequent Asked Questions(FAQ) answering in restricted do- main has attracted increasing attentions in various areas. FAQ is a task to automated response user’s typical questions within specific domain. Most researches use NLP parser to analyze user’s intention and employ ontol- ogy to enrich the domain knowledge. However, syntax analysis performs poorly on the short and informal FAQ questions, and external ontology knowledge bases in specific domains are usually unavailable and expen- sive to manually construct. In our research, we propose a semi-automatic domain-restricted FAQ answering framework SDFA, without relying on any external resources. SDFA detects the targets of questions to assist both the fast domain knowledge learning and the answer retrieval. The proposed framework has been successfully applied in real project on bank domain. Extensive experiments on two large datasets demonstrate the effectiveness and efficiency of the approaches.
|
16:00-16:20 |
Weiming Lu, Zhenyu Zhang, Renjie Lou, Hao Dai, Shansong Yang and Baogang Wei
show abstract
hide abstract
ABSTRACT: Web tables understanding has recently attracted a number of studies. However, many works focus on the tables in English, because they usually need the help of knowledge bases, while the existing knowledge bases such as DBpe- dia, YAGO, Freebase and Probase mainly contain knowledge in English.
|
16:20-16:40 |
Weiming Lu, Renjie Lou, Hao Dai, Zhenyu Zhang, Shansong Yang and Baogang Wei
show abstract
hide abstract
ABSTRACT: Taxonomy is an important component in knowledge bases, and it is an urgent, meaningful but challenging task for Chinese taxonomy construction. In this paper, we propose a taxonomy induction approach from a Chinese ency- clopedia by using combinatorial optimizations. At first, subclass-of relations are derived by validating the relation between two categories. Then, integer program- ming optimizations are applied to find out instance-of relations from encyclope- dia articles by considering the constrains among categories. The experimental re- sults show that our approach can construct a practicable taxonomy from Chinese encyclopedias.
|
16:40-17:00 |
Weiwei Wang; Zhigang Wang; Liangming Pan; Yang Liu; Jiangtao Zhang
show abstract
hide abstract
ABSTRACT: In this paper, we focus on the RDF triples extraction from tables in Chinese ency- clopedias. Firstly, we constructed a Chinese knowledge base through taxonomy mining and class attribute mining. Then, with the help of our knowledge base, we extracted triples from tables through column scoring, table classification and RDF extraction. In our experiments, we practically implemented our approach in 6,618,544 articles from Hudong Baike with 764,292 tables, and extracted about 1,053,407 unique and new RDF triples with an estimated accuracy of 90.2%, which outperforms other similar works.
|
NLP Applications
Oct. 13, 2015(15:40–17:00),
Treasure Palace Hotel-V18 Meeting Room(四楼V18),
Chair: Wenbin JIANG
Return
|
15:40-16:00 |
Weizhi Ma, Min Zhang, Yiqun Liu, Shaoping Ma and Lingfeng Chen
show abstract
hide abstract
ABSTRACT: Tags have been used in different social medias, such as Delicious , Flickr, LinkedIn and Weibo. In previous work, considerable efforts have been made to make use of tags without identification of their different types. In this study, we argue that tags in user profile indicate three different types of informa- tion, say the basics (age, status, locality, etc), interests and specialty of a person. Based on this novel user tag taxonomy, we propose a tag classification approach in Weibo to conduct a clearer image of user profiles, which makes use of three categories of features: general statistics feature (including user links with follow- ers and followings), content feature and syntax feature. Furthermore, different from many previous studies on tag which concentrate on user specialties, such as expert finding, we find that valuable information can be discovered with the ba- sics and interests user tags. We show some interesting findings in two scenarios, including user profiling with people come from different generations and area profiling with mass appeal, with large scale tag clustering and mining in over 6 million identical tags with 13 million users in Weibo data.
|
16:00-16:20 |
Caixia Yuan, Xiaojie Wang and Ziming Zhong
show abstract
hide abstract
ABSTRACT: This paper presents a purely data-driven approach for gener- ating natural language (NL) expressions from its corresponding seman- tic representations. Our aim is to exploit a parsing paradigm for natural language generation (NLG) task, which first encodes semantic represen- tations with a situated probabilistic context-free grammar (PCFG), then decodes and yields natural sentences at the leaves of the optimal parsing tree. We deployed our system in two different domains, one is response generation for a Chinese spoken dialogue system, and the other is instruc- tion generation for a virtual environment in English language, obtaining results comparable to state-of-the-art systems both in terms of BLEU scores and human evaluation.
|
16:20-16:40 |
Zhongping Liang, Caixia Yuan, Bing Leng and Xiaojie WANG
show abstract
hide abstract
ABSTRACT: This paper focuses on recognizing person relations indicated by predicates from large scale of free texts. In order to determine whether a sentence contains a potential relation between persons, we cast this problem to a classification task. Dynamic Convolution Neural Network (DCNN) is improved for this task. It uses frame convolution for making uses of more features efficiently. Experimental results on Chinese per- son relation recognition show that the proposed model is superior when compared to the original DCNN and several strong baseline models. We also explore employing large scale unlabeled data to achieve further im- provements.
|
16:40-17:00 |
Wang Baoxin, Zheng Dequan, Wang Xiaoxue, Zhao Shanshan and Zhao Tiejun
show abstract
hide abstract
ABSTRACT: This paper proposes a method to compute textual entailment strength, taking multiple-choice questions appearing in people's life and study as research objects as they have clear candidate answers, aiming at the phenomenon of long text entailing short text. And two methods are used to answer the college entrance examination geography multiple-choice questions based on the Wikipedia Chinese Corpus in the absence of large-scale questions and answers, one is based on the sentence similarity and the other is based on the textual entailment proposed in this paper. The accuracy rate of the proposed method is 36.93%, increasing by 2.44% than the way based on the word embedding sentence similarity, increasing 7.66% than the way based on the Vector Space Model sentence similarity, which confirm the effectiveness of the method based on the textual entailment.
|
15:40-16:00 |
Ma Chunping and Chen Wenliang
show abstract
hide abstract
ABSTRACT: Recommender systems have emerged as a powerful tool of e-business to narrow the gap between customers and providers. Based on users’ historic behavior, the traditional recommender algorithms predict the preference of user to item. Thanks to rapid development of the Internet, more and more users share the experience and wisdom through online reviews, which become a hot research topic for recommender system. This paper proposes an approach to mining the opinions of users to build a personalized model for each user or item. The experimental results on a real data set show that the proposed approach can improve the accuracy of rating prediction.
|