中国计算机学会中文信息技术专业委员会-2013年会详细日程

会议

会议通知

Conference Notice

ADL 46通知

ADL 46 Notice

会议注册

Conference Register

日程概览Program

Conferenc Detail

ADL Detail

Workshop Detail

事项提醒Attention

NLP&CC 2013大会详细日程

周日，2013年11月17日（Sunday, Nov 17, 2013）
10:00–20:30	君豪一楼大堂：注册（Registration）
17:00–20:00	大宴会厅：自助晚餐（Dinner）
19:00–21:30	君豪-豪景厅： TCCI工作会议（TCCI Business Meeting）
周一，2013年11月18日（Monday, Nov 18, 2013）
07:30–08:45	君豪一楼大堂：注册（Registration）
08:30–08:50	君豪-大宴会厅：开幕（Opening）
08:50–09:50	特邀报告：马维英博士，知识挖掘和语义搜索 Invited Talk by Dr. Weiying MA, Knowledge Mining and Semantic Search
	君豪-大宴会厅
	Session Chair: 赵东岩
	显示摘要show abstract 隐藏摘要hide abstract Abstract：Today’s search engines are primarily operated based on terms and string matching. While this term-based paradigm has been pretty successful in the past 15 years, it has started to show many limitations given the rapid evolution of the Web. The Web today contains not only static documents but also dynamic information about real world entities. It is becoming a digital copy of the world, capturing data from people, products, services, locations, and even objects. This new Web requires a new paradigm that can directly fulfill people’s information needs and empower them with knowledge. In this talk, I will introduce a new entity-based search paradigm and various knowledge mining techniques to construct knowledge graphs from different types of data. I will show how we can use the knowledge to understand queries, enable semantic matching, and provide direct answers to natural language queries. 显示讲者简介show biography 隐藏讲者简介hide biography Keynote Speaker：Dr. Wei-Ying Ma Is an Assistant Managing Director at Microsoft Research Asia, where he oversees multiple research groups including Web Search and Mining, Natural Language Computing, Data Management and Analytics, and Internet Economics and Computational Advertising. He and his team of researchers have developed many key technologies that have been transferred to Microsoft’s Online Services Division, including Bing Search Engine and Microsoft Advertising. He has published more than 250 papers at international conferences and journals. He is a Fellow of the IEEE and a Distinguished Scientist of the ACM. He currently serves on the editorial boards of ACM Transactions on Information System (TOIS) and is a member of the International World Wide Web (WWW) Conferences Steering Committee. In recent years, he has served as program co-chair of WWW 2008 and as general co-chair of ACM SIGIR 2011. He received a Bachelor of Science in electrical engineering from the National Tsing Hua University in Taiwan in 1990. He earned both a Master of Science and a doctorate in electrical and computer engineering from the University of California at Santa Barbara in 1994 and 1997, respectively. More information about him can be found at http://research.microsoft.com/en-us/people/wyma/.
09:50–10:05	君豪大饭店大门：合影（Group Photo）
10:05–10:20	君豪-大宴会厅：茶歇（Coffee/Tea Break）
10:20–11:20	特邀报告：黄昌宁教授，国内中文树库需加强谓词-论元结构描写 Invited Talk by Dr. Changning HUANG, Domestic Chinese Treebanks Need to Strengthen the Predicate-Argument Structure Description
	君豪-大宴会厅
	Session Chair: 周明
	显示摘要show abstract 隐藏摘要hide abstract Abstract：在宾州英文树库的基础上，1998-2000年宾夕法尼亚大学又在新华社新闻语料上建造了宾州中文树库（CTB）。CTB不仅打一开始就把谓词-论元结构的描写列为中文树库标注的重要目标，而且在X-标杠理论的指导下设计了一种层级结构形式，力求使每个短语节点所统辖的括号对或子树只表示一种抽象的语法关系。而回顾国内先后构建的中文树库，虽然一般都在短语标记（如NP、VP、PP等）后面附加了短语的内部结构信息（如主谓、述宾、状中等），甚至还用整数或其他符号标明短语的中心语位置；但是显然都没有把描写句子的谓词-论元结构作为建造树库的重要目标来考量。本文将从X-标杠理论关于短语结构的一般表示模式出发，说明当代形式句法学关于补足语、附加语和说明语等术语的意思，介绍空语类和同指索引在句法结构中的应用。然后，通过CTB的实例具体说明树库对谓词-论元结构的描写。以上是报告的前半部分，接下来报告人将简单回顾国内中文信息处理三十年来的历史与未来的展望。我也很高兴有机会就本人的教学与研究心得跟参会的同行和研究生们展开讨论。显示讲者简介show biography 隐藏讲者简介hide biography Keynote Speaker：黄昌宁教授 1961年毕业于清华大学自动控制理论专业，并留校任教，历任清华大学计算机科学与技术系副教授、教授、博士生导师等职，多年从事人工智能和计算语言学的教学与科研工作。 1999年4 月受聘出任微软亚洲研究院高级研究员、任自然语言计算小组首任主任研究员。2004年4月退休后继续担任微软亚洲研究院高级顾问。研究兴趣包括自动分词、词性标注、句法-语义分析和机器翻译等领域。
11:20–12:20	君豪-大宴会厅： Best Papers
12:20–14:00	君豪负一楼自助餐厅：午餐（Lunch）
14:00–15:20	君豪-大宴会厅	君豪-豪景厅	君豪-豪信厅	君豪-豪仕厅
	Fundamentals 1	Machine Translation 1	Evaluation Workshop 1	Open Fund & Expo
15:20–15:50	茶歇（Coffee/Tea Break）
15:50–17:10	君豪-大宴会厅	君豪-豪景厅	君豪-豪信厅	君豪-豪仕厅
	Machine Learning for NLP	Information Retrieval	Evaluation Workshop 2	Open Fund & Expo
18:30–20:30	君豪-大宴会厅： Poster/Demo Presentations and Banquet, Innovation Demo 技术成果展示）
周二，2013年11月19日（Tuesday, Nov 19, 2013）
08:30–09:30	特邀报告：朱晓瑾，社交媒体转换为知识的数学模型 Invited Talk by Dr. Xiaojin ZHU, Some Mathematical Models to Turn Social Media into Knowledge
	君豪-大宴会厅
	Session Chair: 周国栋
	显示摘要show abstract 隐藏摘要hide abstract Abstract：Social media data-mining opens up many interesting research questions, whose answers correspond to elegant mathematical models that go beyond traditional NLP techniques. In this talk we present two examples, namely estimating intensity from counts and identifying the most chatty users. In both examples, naive heuristic methods do not take full advantage of the data. In contrast, there are mathematical models, in the first case inhomogeneous Poisson process and in the second case multi-arm bandit, with provable properties that better extract knowledge from social media. 显示讲者简介show biography 隐藏讲者简介hide biography Keynote Speaker：朱晓瑾博士 Xiaojin Zhu is an Associate Professor in the Department of Computer Sciences at the University of Wisconsin-Madison, with affiliate appointments in the Departments of Electrical and Computer Engineering and Psychology. Dr. Zhu received his B.S. and M.S. degrees in Computer Science from Shanghai Jiao Tong University in 1993 and 1996, respectively, and a Ph.D. degree in Language Technologies from Carnegie Mellon University in 2005. He was a research staff member at IBM China Research Laboratory from 1996 to 1998. Dr. Zhu received the National Science Foundation CAREER Award in 2010. His research interest is in machine learning, with applications in natural language processing, cognitive science, and social media. http://pages.cs.wisc.edu/~jerryzhu/.
09:30–10:30	特邀报告：张民，基于文档和篇章结构的机器翻译 Invited Talk by Dr. Min ZHANG, Document and Discourse-based Machine Translation
	君豪-大宴会厅
	Session Chair: 李涓子
	显示摘要show abstract 隐藏摘要hide abstract Abstract：Current machine translation methodology translates document sentence by sentence without considering any discourse and document information. In this talk, I will briefly review the state-of-the-art of discourse research from both linguistic and computational viewpoints, and then focus on the discussion of how machine translation can benefit from document and discourse information. Finally, my talk is concluded with future direction discussion. 显示讲者简介show biography 隐藏讲者简介hide biography Keynote Speaker：张民教授张民,苏州大学教授。1997年获哈尔滨工业大学博士学位；1997到1999年在韩国科学技术大学（KAIST）进行博士后研究；1999到2001年在Lernout & Hauspie Asia Pacific (Singapore)任研究员；2001到2003年在Infotalk Technology (Singapore)公司任高级研究主管；2003到2013在新加坡信息通信技术研究院任研究员，负责统计机器翻译组研究工作。近年来在国际顶级学报和顶级会议发表学术论文150余篇。主要学术兼职包括：新加坡东方语言和中文处理学会副理事长、亚洲自然语言处理联盟常务理事、亚太语言信息和计算系列会议国际咨询委员会委员、亚洲语言处理国际学报主编、众多国际会议的大会主席和程序委员会主席等职务。研究方向包括：机器翻译，自然语言处理和基于互联网的极大规模文本数据处理等。
10:30–10:50	君豪-大宴会厅：茶歇（Coffee/Tea Break）
10:50–12:20	Panel：大数据：NLP的机遇和挑战 Big Data: Opportunities and Challenges for NLP
	君豪-大宴会厅
	Moderator 论坛主席: 宗成庆
	显示Panel详情隐藏Panel详情 Panel Info：大数据时代自然语言处理所面临的机遇与挑战（NLP&CC’2013专家论坛概要）近年来，随着国际互联网和移动通信技术的迅猛发展，信息安全与个性化信息服务等一系列基于现代通讯技术的社会需求理所当然地成为人们关注的焦点，云计算、大数据、社会计算、数据挖掘等一批新术语如雨后春笋般地快速涌现，充斥着各大技术领域，并成为众多论坛和会议讨论的话题，可谓如雷贯耳。然而，当人们拂去那些表层的尘埃，拨开那些令人眼花缭乱的五彩云雾，静下来细细地想一想，云计算会成为解决自然语言理解问题的救命稻草吗？大数据时代对自然语言处理技术的根本挑战是什么？近10年来统计自然语言处理方法研究有实质性的进展吗？自然语言理解技术在社会计算中是不可或缺的吗？中文语义是可计算的吗？一系列问题摆在我们面前，形成了海量文本数据处理的一道道屏障，而要攻破每一道屏障都将面临无数个艰巨的任务。为此，本届NLP&CC大会将专门组织以“大数据时代自然语言处理所面临的机遇与挑战”为主题的专题讨论（panel discussion），特邀国内外著名专家作为特邀嘉宾对包括上述问题在内的大家感兴趣的诸多问题进行深入探讨，专家们将根据自己的经验从不同的视角阐述各自的观点，与会人员也可与邀请专家直接进行面对面的讨论，甚至辩论，各抒己见，畅所欲言。我们相信，这种无芥蒂的公开讨论和争辩将为所有参会人员提供新的研发思路和启发，从而为我国自然语言处理和中文计算研究激发新的活力。拟参加专题讨论的嘉宾如下（以专家名字的发音顺序排列）：杜小勇中国人民大学史晓东厦门大学王海峰百度俞栋 Microsoft Research 朱晓槿 University of Wisconsin-Madison
12:20–14:00	君豪-负一楼自助餐厅：午餐（Lunch）
14:00–15:20	君豪-豪景厅	君豪-豪信厅	君豪-豪仕厅	重庆大学主教学楼114
	CIT Applications 1	Machine Translation 2	Fundamentals 2	NLP&CC校园开放日
15:20–15:50	茶歇（Coffee/Tea Break）
15:50–17:10	君豪-豪景厅	君豪-豪信厅	君豪-豪仕厅	重庆大学主教学楼114
	NLP for Social Networks	Web Mining & QA	CIT Applications 2	NLP&CC校园开放日

详细安排

Best Paper 返回时间：2013年11月18日上午（11:20–12:20）；君豪-大宴会厅； Chair：周明
11:20-11:50	Text Window Denoising Autoencoder: Building Deep Architecture for Chinese Word Segmentation Ke Wu; Zhiqiang Gao; Cheng Peng; Xiao Wen 显示摘要show abstract 隐藏摘要hide abstract Deep learning is the new frontier of machine learning re- search, which has led to many recent breakthroughs in English natu- ral language processing. However, there are inherent differences between Chinese and English, and little work has been done to apply deep learning techniques to Chinese natural language processing. In this paper, we pro- pose a deep neural network model: text window denoising autoencoder, as well as a complete pre-training solution as a new way to solve clas- sical Chinese natural language processing problems. This method does not require any linguistic knowledge or manual feature design, and can be applied to various Chinese natural language processing tasks, such as Chinese word segmentation. On the PKU dataset of Chinese word segmentation bakeoff 2005, applying this method decreases the F1 error rate by 11.9\% for deep neural network based models. We are the first to apply deep learning methods to Chinese word segmentation to our best knowledge.
11:50-12:20	Understanding Temporal Intent of User Query based on Time-based Query Classification Pengjie REN; Zhumin CHEN; Xiaomeng SONG; Bin LI; Haopeng YANG; Jun MA 显示摘要show abstract 隐藏摘要hide abstract Web queries are time sensitive which implies that user’s intent for information changes over time. How to recognize temporal intents behind user queries is crucial towards improving the performance of search engines. However, to the best of our knowledge, this problem has not been studied in existing work. In this paper, we propose a timebased query classification approach to understand user’s temporal intent automatically. We first analyzed the shared features of queries’ temporal intent distributions. Then, we present a query taxonomy which group queries according to their temporal intents. Finally, for a new given query, we propose a machine learning method to decide its class in terms of its search frequency over time recorded in Web query logs. Experiments demonstrate that our approach can understand users’ temporal intents effectively.
Fundamentals 1 返回时间：2013年11月18日下午（14:00–15:20）；君豪-大宴会厅； Chair：余正涛
14:00-14:20	Sentence Compression Based on ILP Decoding Method Hongling WANG; Yonglei ZHANG; Guodong ZHOU 显示摘要show abstract 隐藏摘要hide abstract With the tremendous increasing of information, the demands of in-formation from people advanced the development of Nature Language Process-ing (NLP). As a consequent, Sentence compression, which is an important part of automatic summarization, draws much more attention. Sentence compression has been widely used in automatic title generation, Searching Engine, Topic de-tection and Summarization. Under the framework of discriminative model, this paper presents a decoding method based on Integer Linear Programming (ILP), which considers sentence compression as the selection of the optimal com-pressed target sentence. Experiment results show that the ILP-based system maintains a good compression ratio while remaining the main information of source sentence. Compared to other decoding method, this method has the ad-vantage of speed and using fewer features in the case of similar results obtained.
14:20-14:40	Exploring Multiple Chinese Word Segmentation Results Based on Linear Model Chen Su; Yujie Zhang; Zhen Guo; Jinan Xu 显示摘要show abstract 隐藏摘要hide abstract In the process of developing a domain-specific Chinese-English ma-chine translation system, the accuracy of Chinese word segmentation on large amounts of training text often decreases because of unknown words. The lack of domain-specific annotated corpus makes supervised learning approaches un-able to adapt to a target domain. This problem results in many errors in transla-tion knowledge extraction and therefore seriously lowers translation quality. To solve the domain adaptation problem, we implement Chinese word segmenta-tion by exploring n-gram statistical features in large Chinese raw corpus and bi-lingually motivated Chinese word segmentation, respectively. Moreover, we propose a method of combining multiple Chinese word segmentation results based on linear model to augment domain adaptation. For evaluation, we con-duct experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task. The expe-rimental results showed that the proposed method achieves improvements in both F-measure of the Chinese word segmentation and BLEU score of the Chi-nese-English statistical machine translation system.
14:40-15:00	Chinese Argument Extraction Based on Trigger Mapping Yuan Huang; Peifeng Li; Qiaoming Zhu 显示摘要show abstract 隐藏摘要hide abstract Unlike English, Chinese sentences do not have a strict syntactic structure and ellipsis is a common phenomenon, which weaken the effective-ness of syntactic structure in argument extraction. In Chinese event extraction, lots of arguments cannot be extracted from the sentence successfully, because of the loose connection between the nominal trigger and its arguments. This paper brings forward a novel argument extraction approach based on trigger mapping. It maps the nominal trigger to its predicate and uses it as a key to extract syntac-tic features for classification. Experimental results on the ACE 2005 Chinese Corpus show that, in terms of F1-measure in argument identification and role determination, our approach achieves an obvious improvement.
15:00-15:20	基于推理线索构建的事件关系识别方法马彬; 洪宇; 杨雪蓉; 姚建民; 朱巧明显示摘要show abstract 隐藏摘要hide abstract 事件关系识别是一项面向文本信息流进行事件关系判定的自然语言处理技术。事件关系识别的核心任务是以事件为基本语义单元，通过分析事件的篇章结构信息及语义特征，实现事件逻辑关系的浅层检测（即判定任意事件之间是否存在语义关系）。本文通过利用同一话题下的事件元素（事件元素指事件核心词和实体）在话题内的分布特性以及事件元素在话题演化过程中的语义依存规律，提出基于推理线索构建的事件关系识别方法。实验结果显示，相比于基于核心词和实体推理的事件关系识别方法，本文方法在F值上获得了9.57\%的性能提升。
Machine Translation 1 返回时间：2013年11月18日下午（14:00–15:20）；君豪-豪景厅； Chair：史晓东
14:00-14:20	A Simple, Fast Strategy for Weighted Alignment Hypergraph Zhaopeng TU; Jun XIE; Yajuan LV; Qun LIU 显示摘要show abstract 隐藏摘要hide abstract Weighted alignment hypergraph [4] is potentially useful for statistical machine translation, because it is the first study to simultane- ously exploit the compact representation and fertility model of word alignment. Since estimating the probabilities of rules extracted from hypergraphs is an NP-complete problem, they propose a divide-and- conquer strategy by decomposing a hypergraph into a set of independent subhypergraphs. However, they employ a Bull’s algorithm to enumerate all consistent alignments for each rule in each subhypergraph, which is very time-consuming especially for the rules that contain non-terminals. This limits the applicability of this method to the syntax translation models, the rules of which contain many non-terminals (e.g. SCFG rules). In response to this problem, we propose an inside-outside algorithm to ef- ficiently enumerate the consistent alignments. Experimental results show that our method is twice as fast as the Bull’s algorithm. In addition, the efficient dynamic programming algorithm makes our approach applicable to syntax-based translation models.
14:20-14:40	An Efficient Framework to Extract Parallel Units from Comparable Data Lu XIANG; Yu ZHOU; Chengqing ZONG 显示摘要show abstract 隐藏摘要hide abstract Since the quality of statistical machine translation (SMT) is heavily dependent upon the size and quality of training data, many approaches have been proposed for automatically mining bilingual text from comparable corpo-ra. However, the existing solutions are restricted to extract either bilingual sen-tences or sub-sentential fragments. Instead, we present an efficient framework to extract both sentential and sub-sentential units. At sentential level, we con-sider the parallel sentence identification as a classification problem and extract more representative and effective features. At sub-sentential level, we refer to the idea of phrase table’s acquisition in SMT to extract parallel fragments. A novel word alignment model is specially designed for comparable sentence pairs and parallel fragments can be extracted based on such word alignment. We integrate the two levels’ extraction task into a united framework. Experi-mental results on SMT show that the baseline SMT system can achieve signifi-cant improvement by adding those extra-mined knowledge.
14:40-15:00	Collective Corpus Weighting and Phrase Scoring for SMT using Graph-based Random Walk Lei CUI; Dongdong ZHANG; Shujie LIU; Mu LI; Ming ZHOU 显示摘要show abstract 隐藏摘要hide abstract Data quality is one of the key factors in Statistical Machine Translation (SMT). Previous research addressed the data quality problem in SMT by corpus weighting or phrase scoring, but these two types of methods were often investigated independently. To leverage the dependencies between them, we propose an intuitive approach to improve translation modeling by collective corpus weighting and phrase scoring. The method uses the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. An effective graph-based random walk is designed to estimate the quality of sentence pairs and phrase pairs simultaneously. Extensive experimental results show that our method improves performance significantly and consistently in several Chinese-to-English translation tasks.
15:00-15:20	基于翻译日志的统计机器翻译模型剪枝刘凯; 吕雅娟; 姜文斌; 刘群显示摘要show abstract 隐藏摘要hide abstract 在目前的统计机器翻译翻译方法中，所利用的翻译模型的规模一般非常庞大，因此抬高了机器翻译服务的运行成本并限制了机器翻译的应用场景。在本文中，我们提出了一种基于翻译日志的统计机器翻译模型的剪枝方法。本方法利用翻译规则在翻译日志中的命中频数对机器翻译规则进行过滤，保留当前机器翻译模型所需的最小的规则表。实验表明，本方法能够在仅保留原有模型1~3\%翻译规则的前提下达到原有模型的翻译效果。
Evaluation Workshop 1 返回时间：2013年11月18日下午（14:00–15:20）；君豪-豪信厅； Chair：万小军
14:00-14:15	Linking Entities in Short Texts based on a Chinese Semantic Knowledge Base Yi ZENG; Dongsheng WANG; Tielin ZHANG; Hao WANG; Hongwei HAO 显示摘要show abstract 隐藏摘要hide abstract Populating existing knowledge base with new facts is important to keep the knowledge base fresh and most updated. Before importing new knowledge into the knowledge base, entity linking is required so that the enti-ties in the new knowledge can be linked to the entities in the knowledge base. During this process, entity disambiguation is the most challenging task. There have been many studies on leveraging name ambiguity problem via a variety of algorithms. In this paper, we propose an entity linking method based on Chi-nese Semantic Knowledge where entity disambiguation can be addressed by re-trieving a variety of semantic relations and analyzing the corresponding docu-ments with similarity measurement. Based on the proposed method, we devel-oped CASIA_EL, a system for linking entities with knowledge bases. We vali-date the proposed method by linking 1232 entities mined from Sina Weibo to a Chinese Semantic knowledge base, resulting in an accuracy of 88.5\%. The re-sults show that the CASIA_EL system and the proposed algorithm are poten-tially effective.
14:15-14:30	语义特征在评价对象抽取与极性判定中的作用周红照; 侯明午; 颜彭莉; 张叶青; 侯敏; 滕永林显示摘要show abstract 隐藏摘要hide abstract 在评价对象抽取与极性判定研究中，语义特征一直处于“缺位”状态。本文提出与评价对象抽取相关的七类语义特征，评价触发词、评价消解词、评价对象绝缘词、后指动词、前指动词、心理动词、指向定语的评价名词；与极性判定相关的五类语义特征，褒义性名词、贬义性名词、语义偏移型名词、度量衡形容词、语义构式，从引入语义特征的必要性以及如何使用这些特征两方面进行了阐述。实验证明，语义特征的引入有助于提高评价对象抽取及极性判断的准确率。
14:30-14:45	基于情绪因子的中文微博情绪识别与分类张晶; 朱波; 梁琳; 侯敏; 滕永林显示摘要show abstract 隐藏摘要hide abstract 情绪因子是用来表达情绪、态度的语言形式,其包含情绪词、情绪短语、情绪表达式、标点符号、表情符号五种表达手段。本研究以情绪因子中的常用情绪词和情绪短语为基础构建情绪词典, 并针对特殊的情绪表达形式, 结合标点符号和表情符号在情绪分析中的功能建立情绪规则库。系统通过对情绪词典和情绪规则的匹配和计算, 实现了对中文微博情绪的识别和分类,并在2013年 CCF 第二届自然语言处理与中文计算会议中文微博情绪分析评测中取得较好成绩。测试结果证明了该方法的有效性。
14:45-15:00	A Mixed Model for Cross Lingual Opinion Analysis Lin Gui; Ruifeng Xu; Jun Xu; Li Yuan; Yuanlin Yao; Jiyun Zhou; Qiaoyun Qiu; Shuwei Wang; Kam-Fai Wong; Ricky Cheung 显示摘要show abstract 隐藏摘要hide abstract The performances of machine learning based opinion analysis sys-tems are always puzzled by the insufficient training opinion corpus. Such prob-lem becomes more serious for the resource-poor languages. Thus, the cross-lingual opinion analysis (CLOA) technique, which leverages opinion resources on one (source) language to another (target) language for improving the opinion analysis on target language, attracts more research interests. Currently, the transfer learning based CLOA approach sometimes falls to over fitting on single language resource, while the performance of the co-training based CLOA ap-proach always achieves limited improvement during bi-lingual decision. Target to these problems, in this study, we propose a mixed CLOA model, which esti-mates the confidence of each monolingual opinion analysis system by using their training errors through bilingual transfer self-training and co-training, re-spectively. By using the weighted average distances between samples and clas-sification hyper-planes as the confidence, the opinion polarity of testing sam-ples are classified. The evaluations on NLP&CC 2013 CLOA bakeoff dataset show that this approach achieves the best performance, which outperforms transfer learning and co-training based approaches.
15:00-15:15	Entity Linking from Microblogs to Knowledge Base Using ListNet Algorithm Yan WANG; Cheng LUO; Xin LI; Yiqun LIU; Min ZHANG; Shaoping MA 显示摘要show abstract 隐藏摘要hide abstract Entity Linking (EL) is a fundamental technology in Natural Language Processing and Knowledge Engineering. Previous works mainly focus on linking mentioned names recognized in news or articles to knowledge base. However, in social network, user-generated content is quite different from typical news text. Users sometimes use words more informally, even create new words. One entity may have different aliases mentioned by web users, so identifying these aliases calls for more attention than before. Several methods are proposed to mine aliases and a learning-to-rank framework is applied to combine different types of feature together. A binary classifier based on SVM is trained to judge whether the top one candidate given by ranking algorithm is accepted. The evaluation results of NLP\&CC 20131 Entity Linking Track shows the effectiveness of this framework.
Open Fund & Expo 返回时间：2013年11月18日下午（14:00 -17:10）；君豪-豪仕厅； Chair：赵东岩，汤帜
14:00-17:10	开放课题：CCF2012-01-03，试题编辑控件开发方法研究（杨绪兵））开放课题：CCF2012-01-05，汉字字形计算的云服务平台（杨玉星）开放课题：CCF2012-01-06，汉字字形的美观度评价（李伟）开放课题：CCF2012-02-01，西夏文原始字形整理及语料库建设（柳长青）开放课题：CCF2012-02-01，藏文自动校对方法研究（珠杰）开放课题：CCF2012-02-02，基于关键词优化的LDA模型分析中文科技文献中的研究热点和趋势（李保利）
Machine Learning for NLP 返回时间：2013年11月18日下午（15:50–17:10）；君豪-大宴会厅； Chair：林鸿飞
15:50-16:10	Semi-supervised Text Categorization by Considering Sufficiency and Diversity Shoushan LI; Sophia LEE; Wei GAO; Chu-Ren HUANG 显示摘要show abstract 隐藏摘要hide abstract In text categorization (TC), labeled data is often limited while unlabeled data is ample. This motivates semi-supervised learning for TC to improve the performance by exploring the knowledge in both labeled and unlabeled data. In this paper, we propose a novel bootstrapping approach to semi-supervised TC. First of all, we give two basic preferences, i.e., sufficiency and diversity for a possibly successful bootstrapping. After carefully considering the diversity preference, we modify the traditional bootstrapping algorithm by training the involved classifiers with random feature subspaces instead of the whole feature space. Moreover, we further improve the random feature subspace-based bootstrapping with some constraints on the subspace generation to better satisfy the diversity preference. Experimental evaluation shows the effectiveness of our modified bootstrapping approach in both topic and sentiment-based TC tasks.
16:10-16:30	Incorporating Entities in News Topic Modeling Linmei HU; Juanzi LI; Zhihui LI; Chao SHAO 显示摘要show abstract 隐藏摘要hide abstract News articles express information by concentrating on named entities like who, when, and where in news. Whereas, extracting the relationships among entities, words and topics through a large amount of news articles is nontrivial. Topic modeling like Latent Dirichlet Allocation has been applied a lot to mine hidden topics in text analysis, which have achieved considerable performance. However, it cannot explicitly show relationship between words and entities. In this paper, we propose a generative model, Entity-Centered Topic Model(ECTM) to summarize the correlation among entities, words and topics by taking entity topic as a mixture of word topics. Experiments on real news data sets show our model of a lower perplexity and better in clustering of entities than state-of-the-art entity topic model(CorrLDA2).We also present analysis for results of ECTM and further compare it with CorrLDA2.
16:30-16:50	基于Deep Learning的代词指代消解 Xuefeng XI; Guodong ZHOU 显示摘要show abstract 隐藏摘要hide abstract 指代消解一直是自然语言处理中的核心问题.以往将指代消解转化成分类问题,基于机器学习构建的指代消解系统,尽管在性能上有了较大提升,但是在消解过程中所采用的传统机器学习方法,大多仅含单层非线性变换的浅层学习结构,其局限性在于有限样本和计算单元情况下对复杂函数的表示能力有限,面临复杂问题处理时的泛化能力受到一定制约.本文提出一种利用 DBN(deep belief nets)模型的 Deep learning 学习机制进行基于语义特征的指代消解方法. DBN模型由多层无监督的RBM(restricted Boltzmann machine)网络和一层有监督的 BP(back-propagation)网络组成,RBM 网络确保特征向量映射达到最优,最后一层 BP 网络分类 RBM 网络的输出特征向量,从而训练指代消解分类器.在 ACE04 英文语料及 ACE05 中文语料上进行相关测试,实验结果表明, 增加 RBM 训练层数可以提高系统性能.此外,引入对特征集合的抽象分层因素,也对系统性能提升产生积极作用.
16:50-17:10	Discriminative Latent Variable Based Classifier for Translation Error Detection Jinhua Du; Junbo Guo; Fei Zhao 显示摘要show abstract 隐藏摘要hide abstract This paper presents a discriminative latent variable model (DPLVM) based classifier for improving the translation error detection performance for statistical machine translation (SMT). It uses latent variables to carry additional information which may not be expressed by those original labels and capture more complicated dependencies be- tween translation errors and their corresponding features to improve the classi¯cation performance. Specifically, we firstly detail the mathemati- cal representation of the proposed DPLVM method, and then introduce features, namely word posterior probabilities (WPP), linguistic features, syntactic features. Finally, we compare the proposed method with Max- Ent and SVM classifiers to verify its effectiveness. Experimental results show that the proposed DPLVM-based classifier reduce classification er- ror rate (CER) by relative 1.75\%, 1.69\%, 2.61\% compared to the MaxEnt classifier, and relative 0.17\%, 0.91\%, 2.12\% compared to the SVM clas- sifier over three different feature combinations.
Information Retrieval 返回时间：2013年11月18日下午（15:50–17:10）；君豪-豪景厅； Chair：马军
15:50-16:15	Improve Web Search Diversification with Intent Subtopic Mining Aymeric DAMIEN; Min ZHANG; Yiqun LIU; Shaoping MA 显示摘要show abstract 隐藏摘要hide abstract A number of search user behavior studies show that queries with un-clear intents are commonly submitted to search engines. Result diversification is usually adopted to deal with those queries, in which search engine tries to trade-off some relevancy for some diversity to improve user experience. In this work, we aim to improve the performance of search results diversification by generating an intent subtopics list with fusion of multiple resources. We based our approach by thinking that to collect a large panel of intent subtopics, we should consider as well a wide range of resources from which to extract. The resources adopted cover a large panel of sources, such as external resources (Wikipedia, Google Keywords Generator, Google Insights, Search Engines query suggestion and completion), anchor texts, page snippets and more. We selected resources to cover both information seeker (What a user is searching for) and information provider (The websites) aspects. We also proposed an efficient Bayesian optimi-zation approach to maximize resources selection performances, and a new tech-nique to cluster subtopics based on the top results snippet information and Jac-card Similarity coefficient. Experiments based on TREC 2012 web track and NTCIR-10 intent task show that our framework can greatly improve diversity while keeping a good precision. The system developed with the proposed tech-niques also achieved the best English subtopic mining performance in NTCIR-10 intent task.
16:15-16:40	基于本体结构的新闻个性化推荐饶俊阳; 贾爱霞; 冯岩松; 赵东岩显示摘要show abstract 隐藏摘要hide abstract 作为推荐领域主要的方法之一，基于内容的过滤被广泛地应用到新闻推荐中。为了更好地对新闻和用户进行建模，将语义相似度模型引入到基于内容的推荐系统中，挖掘两者之间的语义关联。首先提出了一种基于本体结构的相似度模型(OBSM)，利用在线百科构建的本体结构，计算新闻和用户之间的语义相似度。为了降低本体结构上噪音数据对推荐效果带来的影响，提出X-Ontology聚类算法对本体结构进行清理，并衍生出OBSM的升级模型，X-OBSM。通过中文和英文实验表明，OBSM和X-OBSM比基准模型具有更好的推荐效果，尤其是X-OBSM对本体结构进行清理后，具有比OBSM更高的计算效率。
16:40-17:10	A Fast Matching Method Based on Semantic Similarity for Short Texts Jiaming XU; Pengcheng LIU; Gaowei WU; Zhengya SUN; Hongwei HAO; Bo XU 显示摘要show abstract 隐藏摘要hide abstract As the emergence of various social media, short texts, such as weibos and instant messages, are very prevalent on today’s websites. In order to mine semantically similar information from massive data, a fast and efficient matching method for short texts has become an urgent task. However, the conventional matching methods suffer from the data sparsity in short documents. In this paper, we propose a novel matching method, referred as semantically similar hashing (SSHash). The basic idea of SSHash is to directly train a topic model from corpus rather than documents, then project texts into hash codes by using latent features. The major advantages of SSHash are that 1) SSHash alleviates the sparse problem in short texts, because we obtain the latent features from whole corpus regardless of document level; and 2) SSHash can accomplish similar matching in an interactive real time by introducing hash method. We carry out extensive experiments on real-world short texts. The results demonstrate that our method significantly outperforms baseline methods on several evaluation metrics.
Evaluation Workshop 2 返回时间：2013年11月18日下午（15:50–17:10）；君豪-豪信厅； Chair：冯岩松
15:50 -- 16:10	评测任务总体介绍
16:10 -- 16:25	中文微博观点要素抽取研究丁晟春; 孟美任; 李霄显示摘要show abstract 隐藏摘要hide abstract 本文主要研究了中文微博的观点要素抽取。采用CRFs模型进行抽取实验，深入分析词性，本体特征以及情感特征对抽取结果的影响；并且结合句子的语法结构，针对微博中评论对象的情感特征设计多组位置属性复合特征模板。实验方法应用于NLP\&CC2013中文微博观点要素抽取评测，评测语料内容包含热点话题、电影、电视剧等评论。评测结果显著，宽松评价下位居21所测评单位第1，严格评价下位居第3。
16:25 -- 16:40	中文微博实体链接研究朱敏; 贾真; 左玲; 吴安俊; 陈方正; 柏玉显示摘要show abstract 隐藏摘要hide abstract 针对2013年CCF自然语言处理与中文计算会议(NLP\&CC 2013)中文微博实体链接的任务，使用CCF提供的新浪微博数据作为训练数据以及测试数据，使用西南交通大学耶宝智慧中文分词平台作为自然语言预处理工具。提出一种实体链接的方法，该方法包括从知识库和同义词表查找实体，以及应用改进的拼音编辑距离算法和后缀词表匹配法。提出实体聚类消歧与基于百度百科词频的同类实体消歧相结合的消歧方法。在2013 年CCF 中文微博实体链接评测任务中正确结果的准确率为0.8838，在10 个参赛队伍中名列第三位。结果表明该方法的有效性，并可以适应文本中的噪声。
16:40 -- 16:55	基于向量空间模型的中文微博实体链接吴泳钢显示摘要show abstract 隐藏摘要hide abstract 郑州大学参加了NLP\&CC2013的中文微博实体链接评测任务，主要采取从百度百科对命名实体进行抽取，然后利用命名实体所在上下文语境，抽取核心词汇，进而对实体消歧与链接，最终评测结果微平均准确率为0.8172。
16:55 -- 17:10	微博实体与百科条目链接的多策略研究郭云龙; 徐潇; 向宇; 曾维刚; 李莉显示摘要show abstract 隐藏摘要hide abstract 随着web2.0的兴起，社交网络的规模和产生的数据量急剧增长，如何利用社交媒体内容进行知识库的构建与扩展成为当今研究热点。本文利用编写的爬虫，结合百科网页的结构跳转规则及网络词语建立知识库。使用微博内容的上下文关系及其标签信息，采用机器学习和统计学提取微博间的联系等多策略，多层级的匹配算法对上述任务进行研究。评测准确率为84.99\%，改进后准确率有所提高，达到88.38\% 。
Poster/Demo Presentations and Banquet 返回时间：2013年11月18日晚上（18:30-20:30）；君豪-大宴会厅； Chair：万小军
18:30-20:30	Poster 1: Grey Relational Analysis for Query Expansion Junjie Zou; Zhengtao Yu; Huanyun Zong; Jianyi Guo; Lei Su 显示摘要show abstract 隐藏摘要hide abstract For one-sidedness of the various qualitative expansion methods, we propose a query terms selection method based on Grey Relational Analysis (GRA).We called the fusion expansion technique with GRA (FET-GRA). It calculates weight of expansion term by varied qualitative expansions and comprehensive weight by FET-GRA and thus extracts expansion term in terms of the weight. The experiment result of TREC dataset shows the method (FET-GRA) is substantially superior to TFIDF, Mutual Information , Local Context Analysis.
18:30-20:30	Poster 2: 中文微博实体链接研究朱敏; 贾真; 左玲; 吴安俊; 陈方正; 柏玉显示摘要show abstract 隐藏摘要hide abstract 针对2013年CCF自然语言处理与中文计算会议(NLP\&CC 2013)中文微博实体链接的任务，使用CCF提供的新浪微博数据作为训练数据以及测试数据，使用西南交通大学耶宝智慧中文分词平台作为自然语言预处理工具。提出一种实体链接的方法，该方法包括从知识库和同义词表查找实体，以及应用改进的拼音编辑距离算法和后缀词表匹配法。提出实体聚类消歧与基于百度百科词频的同类实体消歧相结合的消歧方法。在2013 年CCF 中文微博实体链接评测任务中正确结果的准确率为0.8838，在10 个参赛队伍中名列第三位。结果表明该方法的有效性，并可以适应文本中的噪声。
18:30-20:30	Poster 3: 基于互联网的术语定义辨析吴瑞红; 吕学强显示摘要show abstract 隐藏摘要hide abstract 术语定义是术语学的重要内容，是研究工作的基础，必须保证其准确性。而在信息人人可创造、编辑的形势下，术语存在多条定义的情况越来越多，定义的规范性和准确性得不到保证。针对一个术语的多条候选定义，首次提出术语定义辨析模型并给出一种基于互联网的求解方法。该方法从百度百科和百度搜索构建参考释义，总结术语及其定义来源语料中的术语定义模板，根据参考释义和定义的模板从待辨析定义中选出最优定义。实验选取中国知网概念知识元库中部分领域的术语在学术文献和工具书中的定义作为待辨析对象，以所提出的模型和方法进行辨析实验，结果表明该方法辨析的准确率为96.1\%，具有很好的辨析效果。
18:30-20:30	Poster 4: 维吾尔语大词汇语音识别系统识别单元研究努尔麦麦提.尤鲁瓦斯; 吾守尔.斯拉木; 热依曼.吐尔逊显示摘要show abstract 隐藏摘要hide abstract 维吾尔语是一种黏着语，单词不太适合于作为维吾尔语大词汇连续语音识别系统识别单元。本文主要研究了维吾尔语大词汇连续语音识别系统中的识别单元选择问题,设计了更适合维吾尔语的子词识别单元，提出了维吾尔语单词和子词相结合的组合识别单元构建方法，评价了单词、子词和组合识别单元的语言模型和语音识别性能。实验结果表明，本文提出的识别单元在单元数量、语言模型复杂度等方面表现出更加优越的性能，并且使识别系统的单词错误率比基于单词的系统相对减少了22\%。
18:30-20:30	Poster 5: Feature Analysis in Microblog Retrieval Based on Learning to Rank Zhongyuan HAN; Xuwei LI; Muyun YANG; Haoliang QI; Sheng LI 显示摘要show abstract 隐藏摘要hide abstract Learning to rank, which can fuse various of features, performs well in microblog retrieval. However, it is still unclear how the features function in microblog ranking. To address this issue, this paper examines the contribution of each single feature together with the contribution of the feature combinations via the ranking SVM for microblog retrieval modeling. The experimental re-sults on the TREC microblog collection show that textual features, i.e. content relevance between a query and a microblog, contribute most to the retrieval per-formance. And the combination of certain non-textual features and textual fea-tures can further enhance the retrieval performance, though non-textual features alone produce rather weak results.
18:30-20:30	Poster 6: Opinion Sentence Extraction and Sentiment Analysis for Chinese Microblogs Hanxiao SHI; Wei CHEN; Xiaojun LI 显示摘要show abstract 隐藏摘要hide abstract Sentiment analysis of Chinese microblogs is important for scientific research in public opinion supervision, personalized recommendation and social computing. By studying the evaluation task of NLP\&CC 2012, we mainly implement two tasks, namely the extraction of opinion sentence and the determination of sentiment orientation for microblogs. First, we manually label the sample of microblog corpus supplied by the organization, and expand the sentiment lexicon by introducing the Internet sentiment words; second, we construct the different feature sets based on the analysis of the characteristic of Chinese microblogs. Finally, we use SVM classifier to generate a model based on training corpus, and implement the predication of test corpus. Evaluation results show our work has good performance on two tasks.
18:30-20:30	Poster 7: 基于清华树库的复句关系词识别与分类研究李艳翠; 孙静; 周国栋; 冯文贺显示摘要show abstract 隐藏摘要hide abstract 汉语复句在书面文本中所占比例较大，处理汉语复句关系是中文信息处理的问题之一。复句关系词对汉语复句的类别起到非常重要的标识作用。本文介绍了清华汉语树库的标注情况，根据清华汉语树库的标注方法，利用规则从中提取复句关系词并标注其类别。然后利用自动句法树和手工句法树的句法、词法、位置特征进行复句关系词的识别和分类。实验结果表明复句关系词判断准确率可达95.7\%，复句关系类别判断F1值为77.2\%。
18:30-20:30	Poster 8: A Time-Sensitive Model For Microblog Retrieval Cunhui SHI; Bo Xu; Hongfei LIN; Qing Guo 显示摘要show abstract 隐藏摘要hide abstract Microblog, as a way of online communication, can generate large amounts of information in a very short period. Therefore, how to retrieve the latest relevant information becomes a hot research area. Different from tradi-tional information retrieval (IR), the microblog retrieval emphasizes fresh con-tents of the information. In order to solve this problem, we extend the tradition-al IR methods by taking into account the posting time. We propose a time-sensitive retrieval model, which takes the time factor as a prior probability. In the retrieval model, we introduce the pseudo relevance feedback technology as a query expansion approach to improve retrieval performance. Furthermore, we introduce a strategy to filter the initial retrieval results, which takes post quality factors into account including entropy and link features. Experiments on Twitter corpus show that our algorithm is effective to improve the retrieval perfor-mance, and the retrieval results can meet the real time retrieval need well.
18:30-20:30	Poster 9: 基于最大熵的汉语篇章结构自动分析方法涂眉; 周玉; 宗成庆显示摘要show abstract 隐藏摘要hide abstract 篇章分析在很多自然语言处理任务中都是一项重要的基础性工作，具有重要的理论研究价值和实用意义。但自动分析汉语篇章一直是一个难点，鲜有成熟的工具和方法。本文在标有复句逻辑语义关系的清华汉语树库上，研究了汉语篇章语义片段自动切分以及篇章关系的自动标注方法。我们比较了不同序列标注模型对汉语篇章语义单元切分的性能，并提出了基于最大熵模型的汉语篇章结构分析方法。实验结果表明，篇章语义单元自动切分的F 值能达到89.1\%，当篇章语义结构树的高度不超过6 层时，篇章语义关系标注的F 值为63\%。
18:30-20:30	Poster 10: 基于隐主题马尔科夫模型的多特征自动文摘 Jiangming LIU; Jinan XU; YUJIE ZHANG 显示摘要show abstract 隐藏摘要hide abstract 本文基于隐主题马尔科夫模型，消除LDA主题模型的主题独立假设，使得文摘生成过程中充分利用文章的结构信息，并结合基于内容的多特征方法提高文摘质量。此外本文提出在不破坏文章结构的前提下，从单文档扩展到多文档的自动文摘策略。最终实现完善的自动文摘系统。本文通过在DUC2007标准数据集上的实验结果，证明了隐主题马尔科夫模型和文档特征的优越性，并且所实现的自动文摘系统ROUGE值有明显提高。
18:30-20:30	Poster 11: 面向中文网络百科的属性和属性值抽取贾真; 杨宇飞; 何大可; 刘胜久; 尹红风显示摘要show abstract 隐藏摘要hide abstract 类别、属性和实例的属性值是知识库重要组成部分,由相互协作的用户在网络百科产生的数据中包含大量的属性和属性值知识。以中文网络百科条目文章为数据源,提出一种无监督的属性和属性值抽取方法。将属性值看作命名实体,利用频繁模式挖掘和关联分析从文本中抽取类别属性;采用自扩展方法为属性建立触发词表;基于属性触发词和属性值实体标注挖掘属性值抽取模式,利用层次聚类算法获取高质量的模式。在互动百科中采集的数据集上进行实验,结果表明了所提方法的可行性和有效性。
18:30-20:30	Poster 12: A comprehensive method for Text Summarization based on Latent Semantic Analysis Yingjie WANG; Jun MA 显示摘要show abstract 隐藏摘要hide abstract Text summarization aims at getting the most important content in a condensed form from a given document while retains the semantic information of the text to a large extent. It is considered to be an effective way of tackling information overload. There exist lots of text summarization approaches which are based on Latent Semantic Analysis (LSA). However, none of the previous methods consider the term description of the topic. In this paper, we propose a comprehensive LSA-based text summarization algorithm that combines term description with sentence description for each topic. We also put forward a new way to create the term by sentence matrix. The effectiveness of our method is proved by experimental results. On the summarization performance, our ap-proach obtains higher ROUGE scores than several well known methods.
18:30-20:30	Poster 13: C-TERN：一种基于CFSA的军事新闻文本时间信息处理算法王伟; 赵东岩; 苏婷婷显示摘要show abstract 隐藏摘要hide abstract 时间信息在文本分析中占据着非常重要的地位。在军事应用领域，准确识别时间信息能够有效支持军事情报自动获取、以时间为线索的军事文本信息抽取、检索与自动问答等多种应用。该文提出一种基于层叠有限状态自动机（CFSA）的中文军事文本时间表达式识别与规范化算法C-TERN。C-TERN首先利用成熟的分词工具识别出文本中的时间词，然后将从通用语言和军事语言中提取的时间表达式规则分成多层，逐层进行时间信息的精细识别；在规范化过程中，通过四个步骤分别对特殊时间表达式、简单时间表达式、时间段表达式和绝对/相对时间表达式进行推理计算和规范化。算法注意了规则集提取的正确性、规则之间冲突的消解、以及匹配方式的合理性。在多个数据集上的实验结果显示，C-TERN不但能有效地识别出标准时间、偏移时间和不确定性时间表达式，而且能完成对简单、特殊以及隐含的时间点、时间段和偏移时间的推理与规范化，能够满足军事文本时间信息的处理的需要。
18:30-20:30	Poster 14: 汉语隐式篇章关系识别孙静; 李艳翠; 周国栋; 冯文贺显示摘要show abstract 隐藏摘要hide abstract 篇章关系识别作为篇章结构分析的重要内容之一，可以分为显式关系识别和隐式关系识别。初步探讨了汉语篇章隐式关系的识别，应用词汇特征，上下文特征，依存树特征，在一个手工标注的小规模汉语篇章语料库上，总正确率达到了62.15\%。
18:30-20:30	Poster 15: Research of an Improved Algorithm for Chinese Word Segmentation Dictionary Based on Double-Array Trie-tree Wenchuan Yang; Jian Liu; Miao Yu 显示摘要show abstract 隐藏摘要hide abstract Chinese word segmentation dictionary based on the Double-Array Trie Tree has higher efficiency of search, but the dynamic insertion will con-sume a lot of time. This paper presents an improved algorithm-iDAT, which is based on Double-Array Trie Tree for Chinese Word Segmentation Dictionary. After initialization the original dictionary. We implement a Hash process to the empty sequence index values for base array. The final Hash table stores the sum of the empty sequence before the current empty sequence. This algorithm adopt Sunday jumps algorithm of Single Pattern Matching. With slightly and reason-able space cost increasing, iDAT reduces the average time complexity of the dynamic insertion process in Trie Tree. Practical results shows it has a good op-eration performance.
18:30-20:30	Poster 16: Study on Tibetan Word Segmentation as Syllable Tagging Yachao Li; Hongzhi Yu 显示摘要show abstract 隐藏摘要hide abstract Tibetan word segmentation (TWS) is the basic problem for Tibetan natural language processing. The paper reformulates the segmentation as a syllable tagging problem, and studies the performance of TWS with different sequence labeling models. Experimental results show that, the TWS system with conditional random field achieves the best performance in the condition of current 4-tag set, at the same time, the other models achieve good results too. All the above show that, the segmentation as a syllable tagging problem that is an efficient approach to deal with TWS.
18:30-20:30	Poster 17: 基于句法分析的跨语言情感分析陈强; 何炎祥; 刘续乐; 孙松涛; 牛菲菲; 罗楚威显示摘要show abstract 隐藏摘要hide abstract 面对日益增长的多语种商品评论信息资源, 跨语言情感分析将为进一步提升多语言的服务能力打下良好的基础,对商家掌握用户喜好变化有重要意义。目前,跨语言情感分析的研究处于发展之中,具有不可忽视的作用。本文提出利用句法分析模型,将语句分成若干组合词,根据组合词的主谓成分、情感词色彩的强弱不同,分别赋予不同的权重,统计分析该语句的情感分布特征,利用得到的特征参数训练分类器,再将训练好的分类器用于测试语料的情感分类。实验结果表明,该方法的情感分类判别准确率较理想,与已有的判别方法相比有一定提高。此研究方法也可用于语句的比较级判别和否定句的极性判断等领域。
18:30-20:30	Poster 18: Simple Yet Effective Method for Entity Linking in Microblog-Genre Text Qingliang MIAO; Huayu LU; Shu ZHANG; Yao MENG 显示摘要show abstract 隐藏摘要hide abstract Semantic analysis microblog data is a challenging, emerging research area. Unlike news text, microblogs pose several new challenges, due to their short, noisy, contextualized and real-time nature. In this paper, we investigate how to link entities in microblog posts with knowledge base and adopt a cascade linking approach. In particular, we first use a mention expansion model to identify all possible entities in the knowledge base for a mention based on a variety of sources. Then we link the mentions with the corresponding entities in the knowledge base by collectively considering lexical matching, popularity probability and textual similarity.
18:30-20:30	Poster 19: 面向微博短文本的细粒度情感特征抽取方法贺飞艳; 刘楠; 何炎祥; 刘建博; 彭敏显示摘要show abstract 隐藏摘要hide abstract 通过结合 TF-IDF 方法与方差统计方法提出一种实现多分类特征抽取的计算方法。采用先极性判断,后细粒度情感判断的处理方法, 构建细粒度情感分析与判断流程, 并将其应用于微博短文本的细粒度情感判断中。通过 NLPCC2013 评测所提供的训练语料对该方法有效性进行了验证,并在 NLPCC2013 评测任务中运用该方法, 证实了该方法具有较好的抽取效果。
18:30-20:30	Poster 20: Grammatical Phrase-level Opinion Target Extraction on Chinese Microblog Messages Haochen ZHANG; Yiqun LIU; Min ZHANG; Shaoping MA 显示摘要show abstract 隐藏摘要hide abstract Microblog is one of the most widely used web applications. Weibo, which is a microblog service in China, produces plenty of opinionated messages every second. Sentiment analysis on Chinese Weibo impacts many aspects of business and politics. In this work, we attempt to address the opinion target extraction, which is one of the most important aspects of sentiment analysis. We propose a unified approach that concentrates on phrase-level target extraction. We assume that a target is represented as a subgraph of the sentence’s dependency tree and define the grammatical relations that point to the target word as TARRELs. We conduct the extraction by classifying grammatical relations with a cost-sensitive classifier that enhances performance of unbalanced data and figuring out the target subgraph by connecting and recovering TAR-RELs. Then we prune the noisy targets by empirically summarized rules. The evaluation results indicate that our approach is effective to the phrase-level target extraction on Chinese microblog messages.
18:30-20:30	Poster 21: 多策略中文微博细粒度情绪分析研究欧阳纯萍; 阳小华; 雷龙艳; 徐强; 余颖; 刘志明显示摘要show abstract 隐藏摘要hide abstract 朴素贝叶斯(NB)、支持向量机(SVM)、最近邻(KNN)等算法是目前分析微博情绪的常用方法,它们在粗粒度的微博情绪分析上都取得了较好的效果。而在实际应用中, 微博的细粒度情绪往往更能展示用户对于事物的态度。因此, 针对中文微博的用户情绪分析问题, 提出了一种基于多策略融合的细粒度情绪分析方法。我们首先采用朴素贝叶斯算法对微博的有无情绪分类问题进行研究,然后构建有情绪微博的 21 维特征向量,最后分别采用SVM 和 KNN 算法对微博进行细粒度情绪分析。以新浪微博作为实验对象,结果表明多策略集成方法要好于单一分类算法。而在多策略集成方法中,“NB+SVM”方法又略优于“NB+KNN”方法。
18:30-20:30	Poster 22: 基于社会关系网络的半监督情感分类薛云霞; 李寿山; 王中卿显示摘要show abstract 隐藏摘要hide abstract 情感分类是自然语言处理的一个热点研究问题。本文研究基于半监督学习的情感分类方法，即在很少规模的标注样本的基础上，借助非标注样本提高情感分类性能。实现半监督情感分类的一个重要思想是构建已标注样本和非标注样本之间的联系，将标注样本的标签传播到非标注样本上面。然而，以往的半监督情感分类研究在构建这种联系的时候仅仅关注评论的文本信息。我们发现由于社会关系网络的存在，评论与评论之间还存在一些社会联系。因此，本文将基于样本的社会关系，提出了一种新的半监督学习方法。具体来讲，本文提出了一种基于文档—词及社会关系的二部图模型，并根据标签传播算法将未标注样本加入到分类器的构建中。实验结果表明，加入社会关系网络的半监督情感分类方法明显优于传统的仅利用评论文本信息的半监督情感分类方法。
18:30-20:30	Poster 23: Research on Semantic-based Passive Transformation in Chinese-English Machine Translation Wenfei CHANG; Zhiying LIU; Yaohong JIN 显示摘要show abstract 隐藏摘要hide abstract Passive voice is widely used in English while it is less used in Chinese, which is more prevalent in patent documents. The difference requires us to transform the voice in Chinese-English machine translation in order to make the result more smooth and natural. Previous studies in this field are based on statistics, but the effect is not very good. In this paper we propose a strategy to deal with the Chinese-English passive voice transformation from the perspective of semantic. Through analyzing the sentences, a series of transformation rules are summarized. Then we test them in our system. Experiment results show that the transformation rules can achieve an accuracy of 89.1\% overall.
18:30-20:30	Poster 24: 一种无监督的中文漫画对白自动定位方法 Dong LIU; Luyuan LI; Yongtao WANG; Zhi TANG 显示摘要show abstract 隐藏摘要hide abstract 自动提取漫画图像中的对白不仅有助于在移动设备上清晰地显示漫画中的文字信息，而且可进一步实现对白翻译、有声阅读等新的漫画移动阅读应用。针对中文漫画图像的特点，提出一种无监督的中文漫画图像对白自动定位方法，以满足中文漫画移动阅读的需求。不同于现有基于学习的方法，该方法不需要训练集，且具有较强的鲁棒性，主要包括三个步骤：1）利用包围漫画图像文字的空白区域（称为气泡，下同）的连通性进行气泡检测，并在气泡中检测完整字符对；2）基于字符形状与字符排版规则的一致性，聚类形成字符行或字符列，并提取字体特征；3）联合多页漫画图像字体特征，采用贝叶斯分类器检测多页漫画中的剩余字符。在包含900页漫画图像上的实验结果表明，该方法可以有效定位中文漫画图像中的对白区域，取得了比较满意的实验结果。
18:30-20:30	Poster 25: The Spoken/Written Language Classification of English Sentences with Bilingual Information Kuan LI; Zhongyang XIONG; Yufang ZHANG; Xiaohua LIU; Ming ZHOU; Guanghua ZHANG 显示摘要show abstract 隐藏摘要hide abstract To alleviate the problem with Chinese being poor at telling the dif-ference between spoken and written English which is important for learning and using the language, we propose to classify English sentences with bilingual in-formation into the two categories automatically. Based on the text categoriza-tion technology, we explore a variety of features, including words, statistics and their combinations, and find that a classification accuracy nearly 95\% can be achieved in the open test through Chinese characters + sentence length + aver-age syllable number, or other similar combinations.
18:30-20:30	Demo 1: Research on the opinion mining system for massive social media data Lijun Zhao; Yingjie Ren; Ju Wang; Lingsheng Meng; Cunlu Zou 显示摘要show abstract 隐藏摘要hide abstract The authors are supposed to discover the valuable public opinions based on the massive social media data (the comments data, social relation data and location data collected from social website like twitter), so that it could help users to make better decisions during shopping. In order to solve this problem, the authors used micro-blog data collected from Sina as an example and pro-posed an opinion mining system based on the distributed computing system (Hadoop and related projects). This system was designed for a general purpose. It is not only for the restaurant area but also for the other fields like electronic products.
18:30-20:30	Demo 2: Design and Implementation of News-Oriented Automatic Summarization System Based On Chinese RSS Jie WANG; Jie MA; Yingjun LI 显示摘要show abstract 隐藏摘要hide abstract Automatic summarization is an important research branch of natural language processing. The automatic summarization should provide information to users from different point of views for better understanding. Aiming at the characteristics of the news, an automatic summarization system is constructed from two aspects: keywords and key sentences. Then, the location factor is added to optimize the keywords extraction algorithm. Meanwhile, the key sentences extraction algorithm is improved through introducing keywords factors. On this basis, in allusion to the existing problems of RSS, this paper builds a user-interest model. Finally, after the verification in terms of the feasibility and the effectiveness, the result shows it is effective to improve the accuracy and the user experience of the RSS feeds
18:30-20:30	Demo 3: 基于风格模仿的中国书法合成 Wei LI 显示摘要show abstract 隐藏摘要hide abstract TBC
18:30-20:30	Demo 4: 西夏文数字化处理研究 Changqing LIU 显示摘要show abstract 隐藏摘要hide abstract TBC
CIT Applications 1 返回时间：2013年11月19日下午（14:00–15:20）；君豪-豪景厅； Chair：杨沐昀
14:00-14:20	A Unified Framework for Emotional Elements Extraction based on Finite State Matching Machine Yunzhi TAN; Yongfeng ZHANG; Min ZHANG; Yiqun LIU; Shaoping MA 显示摘要show abstract 隐藏摘要hide abstract Traditional methods for sentiment analysis mainly focus on the con-struction of emotional resources based on the review corpus of specific areas, and use phrase matching technologies to build a list of product feature words and opinion words. These methods bring about the disadvantages of inadequate model scalability, low matching precision, and high redundancy. Besides, it is particularly difficult to deal with negative words. In this work, we designed a unified framework based on finite state matching machine to deal with the problems of emotional element extraction. The max-matching principal and negative words processing can be integrated into the framework naturally. In addition, the framework leverages rule-based methods to filter out illegitimate feature-opinion pairs. Compared with traditional methods, the framework achieves high accuracy and scalability in emotional element extraction. Exper-imental results show that the extracting accuracy is up to 84\%, which has in-creased by 20\% comparing with traditional phrase matching techniques.
14:20-14:40	面向话题的新闻综述报告自动生成研究 Lu LU; Lei HOU; Lanshan ZHANG 显示摘要show abstract 隐藏摘要hide abstract 新闻综述报告可以有效帮助阅读者解决面对海量新闻文档的阅读困难。对新闻集合进行深入分析，可以很好地提取新闻集合潜在的热点、观点。本文利用新闻事件的话题、实体及其关联和发展趋势分析，建立对新闻事件分析报告模型。模型可以从多个角度描述新闻事件，并依照新闻综述报告的写作特点，制定出一种计算机自动分析报告生成的框架，对新闻事件在话题和实体上的分析结果进行组合，自动生成一篇观点分析透彻、图形图表信息生动准确的新闻事件分析报告。
14:40-15:00	Query Generation Techniques for Patent Prior-Art Search in Multiple Languages Dong ZHOU; Jianxun LIU; Sanrong ZHANG 显示摘要show abstract 隐藏摘要hide abstract Patent prior-art search is an necessary step to ensure that no previous similar disclosures were made before granting an patent. The task is to identify all relevant information which may invalidate the originality of a claim of a patent application. Using the whole patent or extracting high indicative terms to form a query reduces the search burden on the user. To date, There are no large-scale experiments conducted specifically for evaluating query generation techniques used in patent prior-art search in multiple languages. In the following paper, we firstly introduced seven methods for generating patent queries for ranking. Then a large-scale experimental evaluation was carried out on the CLEF-IP 2009 multilingual dataset in English, French and German. A detail comparison of the different methods in terms of performance and efficiency has been performed in addition to the use of full-length documents as queries in the patent search. The results show that some methods, work well in information retrieval in general, fail to achieve the same effectiveness in the patent search. Different methods demonstrated distinct performance w.r.t query and document languages.
15:00-15:20	Automatic Assessment of Information Disclosure Quality in Chinese Annual Reports Xin Ying Qiu; Shengyi Jiang; Kebin Deng 显示摘要show abstract 隐藏摘要hide abstract Information disclosure in annual reports is a mandatory re- quirement for publicly traded companies in China. The quality of in- formation disclosure will reduce information asymmetry and therefore support market efficiency. Currently, the evaluation of the information disclosure quality in Chinese reports is conducted manually. It remains an untapped field for NLP and text mining community. The goal of this paper is to develop automatic assessment system for information disclo- sure quality in Chinese annual reports. Our assessment system framework incorporates different technologies including Chinese document model- ing, Chinese readability index construction, and multi-class classifica- tion. Our explorative and systematic experiment results show that: 1) our automatic assessment system can produce solid predictive accuracy for disclosure quality, especially in "excellent" and "fail" categories; 2) our system for Chinese annual reports assessment achieves better predic- tive accuracy in certain perspective than the counterparts of the English annual reports prediction; 3) our readability index for Chinese docu- ments, as well as other findings from system performance, may provide enlightenment for a better understanding about the quality features of Chinese company annual reports.
Machine Translation 2 返回时间：2013年11月19日下午（14:00–15:20）；君豪-豪信厅； Chair：吕雅娟
14:00-14:20	A Method to Construct Chinese-Japanese Named Entity Translation Equivalents Using Monolingual Corpora Kuang RU; Jin'an XU; Yujie Zhang; Peihao WU 显示摘要show abstract 隐藏摘要hide abstract The traditional method of Named Entity Translation Equivalents ex-traction is often based on large-scale parallel or comparable corpora. But the practicability of the research results is constrained by the relatively scarce of the bilingual corpus resources. We combined the features of Chinese and Japanese, and proposed a method to automatically extract the Chinese-Japanese NE trans-lation equivalents based on inductive learning from monolingual corpus. This method uses the Chinese Hanzi and Japanese Kanji comparison table to calcu-late NE instances similarity between Japanese and Chinese. Then, we use in-ductive learning method to obtain partial translation rules of NEs through ex-tracting the differences between Chinese and Japanese high similarity NE in-stances. In the end, the feedback process refreshes the Chinese and Japanese NE similarity and translation rule sets. Experimental results show that the proposed method is simple and efficient, which overcome the shortcoming that the tradi-tional methods have a dependency on bilingual resource.
14:20-14:40	日语时间表达式识别与日汉翻译研究赵紫玉; 徐金安; 张玉洁; 刘江鸣显示摘要show abstract 隐藏摘要hide abstract 基于自定义知识库,采用一种知识库强化规则集,以及与统计模型相结合的日语时间表达式识别方法。按照 Timex2 标准对时间表现的细化分类,结合日语时间词的特点,渐进地扩展重构日语时间表达式知识库, 实现基于知识库获取的规则集的优化更新,旨在不断提高时间表达式的识别精准度。同时,融合 CRF 统计模型提高日语时间表达式识别的泛化能力。另外,考察基于短语的翻译模型翻译时间词的精度,提出统计机器翻译(SMT)结合规则翻译日语时间词的必要性。实验结果显示日语时间表达式识别的开放测试 F1 值达到 0.8987, 以及基于日汉时间词平行字典与规则的翻译精度和召回率都略高于基于统计机器翻译模型。
14:40-15:00	基于加权词汇衔接的文档级机器翻译自动评价贡正仙; 李良友显示摘要show abstract 隐藏摘要hide abstract 词汇衔接是篇章的重要特征，本文在已有的文档词汇衔接评价LC方法的基础上，提出了基于权重的LC，即WLC，该方法通过在文档词图上运行PageRank算法获得词汇权重；此外,本文根据词性信息使得PageRank算法偏向特定的词汇，据此提出了PWLC方法。实验表明在文档级别上本文建议的两种方法与人工评价的相关度都优于LC；同时，融合这两种方法之后，BLEU和TER在文档级别上的评价性能有了显著的提高。
15:00-15:20	Pseudo In-Domain Data Selection from Large-Scale Web Corpus for Spoken Language Translation Shixiang Lu; Xingyuan Peng; Zhenbiao Chen; Bo Xu 显示摘要show abstract 隐藏摘要hide abstract This paper is concerned with exploring efficient domain adap- tation for the task of statistical machine translation, which is based on extracting sentence pairs (pseudo in-domain subcorpora, that are most relevant to the in domain corpora) from a large-scale general-domain web bilingual corpus. These sentences are selected by our proposed un- supervised phrase-based data selection model. Compared with the tra- ditional bag-of-words models, our phrase-based data selection model is more effective because it captures contextual information in modeling the selection of phrase as a whole, rather than selection of single words in isolation. These pseudo in-domain subcorpora can then be used to train small domain-adapted spoken language translation system which outperforms the system trained on the entire corpus, with an increase of 1.6 BLEU points. Performance is further improved when we use these pseudo in-domain corpus/models in combination with the true in-domain corpus/model, with increases of 4.5 and 3.9 BLEU points over single in- and general-domain baseline system, respectively.
Fundamentals 2 返回时间：2013年11月19日下午（14:00–15:20）；君豪-豪仕厅； Chair：荀恩东
14:00-14:25	Chinese Negation and Speculation Detection with Conditional Random Fields Zhancheng Chen; Bowei Zou; Qiaoming Zhu; Peifeng Li 显示摘要show abstract 隐藏摘要hide abstract Negative and speculative expressions are popular in natural lan-guage. Recently, negation and speculation detection has become an important task in computational linguistics community. However, there are few related research on Chinese negation and speculation detection. In this paper, a su-pervised machine learning method with conditional random fields (CRFs) is proposed to detect negative and speculative information in scientific litera-ture. This paper also evaluates the effectiveness of each feature under the character-based and word-based framework, as well as the combination of features. Experimental results show that the single-word feature and the part of speech feature are effective, and the combined features improve the per-formance furthest. Our Chinese negation and speculation detection system in sentence level achieves 94.70\% and 87.10\% of accuracy, respectively.
14:25-14:50	藏文文本自动校对方法及系统设计珠杰; 李天瑞显示摘要show abstract 隐藏摘要hide abstract 本文以藏文音节拼写检查、梵音转写藏文检查、接续关系检查、词语检查为研究内容，提出了藏文文本自动校对框架和接续关系检查算法。根据该框架及算法，设计并实现了藏文自动校对系统，并通过相应的实验证明了算法和系统的可靠性和有效性。
14:50-15:20	Language Model for Cyrillic Mongolian to Traditional Mongolian Conversion Feilong Bao; Guanglai Gao; Xueliang Yan; Hongwei Wang 显示摘要show abstract 隐藏摘要hide abstract Traditional Mongolian and Cyrillic Mongolian are both Mongolian languages that are respectively used in china and Mongolia. With similar oral pronunciation, their writing forms are totally different. A large part of Cyrillic Mongolian words have more than one corresponds in Traditional Mongolian. This makes the conversion from Cyrillic Mongolian to Traditional Mongolian a hard problem. To overcome this difficulty, this paper proposed a Language model based approach, which takes the advantage of context information. Experimental results show that, for Cyrillic Mongolian words that have multiple correspondence in Traditional Mongolian, the correct rate of this approach reaches 87.66\%, thereby greatly improve the overall system performance.
NLP for Social Networks 返回时间：2013年11月19日下午（15:50–17:10）；君豪-豪景厅； Chair：何婷婷
15:50-16:10	Research on Building Family Networks Based on Bootstrapping and Coreference Resolution Jinghang GU; Yanan HU; Longhua QIAN; Qiaoming ZHU 显示摘要show abstract 隐藏摘要hide abstract Personal Family Network is an important component of social net-works, therefore, it is of great importance of how to extract personal family re-lationships. We propose a novel method to construct personal families based on bootstrapping and coreference resolution on top of a search engine. It begins with seeds of personal relations to discover relational patterns in a bootstrap-ping fashion, then personal relations are further extracted via these learned pat-terns, finally family networks are fused using cross-document coreference reso-lution. The experimental results on a large-scale corpus of Gigaword show that, our method can build accurate family networks, thereby laying the foundation for social network analysis.
16:10-16:30	Learning Sentence Representation for Emotion Classification on Microblogs Duyu TANG; Bing QIN; Ting LIU; Zhenghua LI 显示摘要show abstract 隐藏摘要hide abstract This paper studies the emotion classification task on microblogs. Given a message, we classify its emotion as happy, sad, angry or surprise. Existing methods mostly use the bag-of-word representation or manually designed features to train supervised or distant supervision models. However, manufacturing feature engines is time-consuming and not enough to capture the complex linguistic phenomena on microblogs. In this study, to overcome the above problems, we utilize pseudo-labeled data, which is extensively explored for distant supervision learning and training language model in Twitter sentiment analysis, to learn the sentence representation through Deep Belief Network algorithm. Experimental results in the supervised learning framework show that using the pseudo-labeled data, the representation learned by Deep Belief Network outperforms the Principal Components Analysis based and Latent Dirichlet Allocation based representations. By incorporating the Deep Belief Network based representation into basic features, the performance is further improved.
16:30-16:50	Every Term Has Sentiment: Learning from Emoticon Evidences for Chinese Microblog Sentiment Analysis Fei JIANG; Anqi CUI; Yiqun LIU; Min ZHANG; Shaoping MA 显示摘要show abstract 隐藏摘要hide abstract Chinese microblog is a popular Internet social medium where users express their sentiments and opinions. But sentiment analysis on Chinese microblogs is difficult: The lack of labeling on the sentiment polarities restricts many supervised algorithms; out-of-vocabulary words and emoticons enlarge the sentiment expressions, which are beyond traditional sentiment lexicons. In this paper, emoticons in Chinese microblog messages are used as annotations to automatically label noisy corpora and construct sentiment lexicons. Features including microblogspecific and sentiment-related ones are introduced for sentiment classification. These sentiment signals are useful for Chinese microblog sentiment analysis. Evaluations on a balanced dataset are conducted, showing an accuracy of 63.9\% in a three-class sentiment classification of positive, negative and neutral. The features mined from the Chinese microblogs also increase the performances.
16:50-17:10	Active Learning for Cross-Lingual Sentiment Classification Shoushan LI; Rong WANG; Huanhuan LIU; Chu-Ren HUANG 显示摘要show abstract 隐藏摘要hide abstract Cross-lingual sentiment classification aims to predict the sentiment orientation of a text in a language (named as the target language) with the help of the resources from another language (named as the source language). However, current cross-lingual performance is normally far away from satisfaction due to the huge difference in linguistic expression and social culture. In this paper, we suggest to perform active learning for cross-lingual sentiment classification, where only a small scale of samples are actively selected and manually annotated to achieve reasonable performance in a short time for the target language. The challenge therein is that there are normally much more labeled samples in the source language than those in the target language. This makes the small amount of labeled samples from the target language flooded in the aboundance of labeled samples from the source language, which largely reduces their impact on cross-lingual sentiment classification. To address this issue, we propose a data quality controlling approach in the source language to select high-quality samples from the source language. Specifically, we propose two kinds of data quality measurements, intra- and extra-quality measurements, from the certainty and similarity perspectives. Empirical studies verify the appropriateness of our active learning approach to cross-lingual sentiment classification.
Web Mining & QA 返回时间：2013年11月19日下午（15:50–17:10）；君豪-豪信厅； Chair：秦兵
15:50-16:10	基于知识库的中文自然语言问句的自动理解 Kun XU; Yansong FENG; Dongyan ZHAO; Liwei CHEN; Lei ZOU 显示摘要show abstract 隐藏摘要hide abstract 随着网络数据规模的飞速增长，针对大规模结构化数据的获取、组织和利用已经成为信息处理领域的热点问题。特别是对于结构化百科知识库的查询理解，存在着用户提问的自然语言问句与结构化查询（如SPARQL语句）之间的鸿沟。为此我们提出了从自然语言问句到结构化查询的转换框架，该方法从自然语言问句的句法结构入手，提出了一套启发式识别实体与关系的方法，并利用语料库建立了从实体到知识库的映射，对谓词进行消歧，进而转化为计算机可理解的结构化查询语言。我们从百度知道上抽取了人物、地点、组织三类一共42个问题作为标准测试集。实验结果表明，本文提出的框架能够有效地将中文自然语言问句转换为结构化查询，为下一代智能问答系统打下良好的基础。
16:10-16:30	A Hybrid Approach for Extending Ontology from Text Wei HE; Shuang LI; Xiaoping Yang 显示摘要show abstract 隐藏摘要hide abstract Ontology is applied to various fields of computer as a conceptual modeling tool, and is used to organize information and manage knowledge. Ontology extension is used to add the new concepts and relationship into the existing ontology, which is a more complex task. In this paper, we propose a hybrid approach for ontology extension from text using semantic relatedness between words, which exploit co-occurrence analysis, word filter and semantic relatedness between words to find the potential concepts from text, denoted as the extended concepts. And we take advantage of extension rules and subsumption analysis to find the relationship between concepts, which is used to add the extended concepts into the existing ontology. The improved recall, precision and F1-Measure have been presented and used to evaluate our method proposed in this paper. Experimental results show that the proposed method is more reasonable and promising. It has a stronger competitiveness and application ability.
16:30-16:50	Expanding User Features with Social Relationships in Social Recommender Systems Chengjie Sun, Lei Lin, Yuan Chen, Bingquan Liu 显示摘要show abstract 隐藏摘要hide abstract Although recommender system has been studied for many years, the research of social recommender system is just beginning. Plenty of information can be used in social networks to improve the performance of recommender system. However, some information is very sparse when used as features. We call this feature sparsity problem. In this paper, we aimed at solving feature sparsity problem. A new strategy was proposed to expand user features by social relationships. Experiments on two real world datasets demonstrated that our method can significantly improve the recommendation performance.
16:50-17:10	Simulated Spoken Dialogue System Based on IOHMM with User History Changliang LI; Bo XU; Xiuying WANG; Wendong GE; Hongwei HAO 显示摘要show abstract 隐藏摘要hide abstract Expanding corpora is very important in designing a spoken dialogue system (SDS). In this big data era, data is expensive to collect and there are rare annotations. Some researchers make much work to expand corpora, most of which is based on rule. This paper presents a probabilistic method to simulate dialogues between human and machine so as to expand a small corpus with more varied simulated dialogue acts. The method employs Input/output HMM with user history (UH-IOHMM) to learn system and user dialogue behavior. In addition, this paper compares with simulation system based on standard IOHMM. We perform experiments using the WDC-ICA corpus, weather do-main corpus with annotation. And the experiment result shows that the method we present in this paper can produce high quality dialogue acts which are simi-lar to real dialogue acts.
CIT Applications 2 返回时间：2013年11月19日下午（15:50–17:10）；君豪-豪仕厅； Chair：徐金安
15:50-16:10	中文电子文档的数学公式定位研究 Xiaoyan LIN; Liangcai GAO; Zhi TANG 显示摘要show abstract 隐藏摘要hide abstract 区别于传统的基于图像和西文文档的公式定位方法,针对中文电子文档的特点,提出了一种基于机器学习和规则相结合的独立公式和内嵌公式的定位方法。在其中,设计了适合中文文档的页面分行策略和词块划分规则;选择了适合中文文档的公式特征和机器学习算法;针对公式定位中的过分割问题提出了行合并与词块合并等后处理手段。实验结果表明,该方法可以有效地从中文电子文档中自动定位公式区域。此外,构建了公开可用的中文数据集,以促进不同数学公式定位方法间的相互比较及性能评估。
16:10-16:30	基于语义构件的甲骨文字库自动生成技术研究吴琴霞; 栗青生; 高峰显示摘要show abstract 隐藏摘要hide abstract 通过对甲骨字的构成分析,针对甲骨文字形多变、异体字多等特点,提出了一种基于语义构件的甲骨文字库自动生成方法。该方法以动态描述库为基础,首先通过算法提取甲骨字的构件特征信息;然后重组笔元生成语义构件,再给语义构件加上特征描述生成构件知识库。通过仿射变换重复使用语义构件自动生成任意甲骨字。实验表明,该方法有效地解决甲骨文无字库输入的实现,并且还可以解决甲骨字编码、构件统计、未释字的考释等。
16:30-16:50	基于特征加权的汉字点笔画生成研究栗青生; 熊晶; 吴琴霞; 杨玉星显示摘要show abstract 隐藏摘要hide abstract 汉字字形设计是一项艰巨的工作，与英文字形设计不同，英文只要设计好了二十六个字母就能形成一种字体，而汉字每产生一套字库就要设计成千上万个汉字，非艺术家和汉字专家很难完成。本文针对汉字字形设计和开发方面的困难，提出了基于特征点抽象的汉字字形描述方法和汉字字形生成方法，研究了特征点、特征表达式、特征点的权值和权矢量等在汉字字形生成中的应用技术和方法，并以点的生成为例，设计了汉字点笔画的生成算法，并进行了验证实验。实验结果证明了算法的可靠性和实用性，为汉字其它笔画的生成提供了一种解决方案，可切实提高汉字字形设计的效率。
16:50-17:10	Structure-based Web Access Method for Ancient Chinese Characters Xiaoqing Lu Yingmin Tang, Zhi Tang, Yujun Gao, Jianguo Zhang 显示摘要show abstract 隐藏摘要hide abstract How to preserve and make use of ancient Chinese characters is not only a mission to contemporary scientists but is also a technical challenge. This paper proposes a feasible solution to enable character collection, management, and access on the Internet. Its advantage lies in a unified representation for en-coded and uncoded characters that provide a visual convenient and efficient re-trieval method that does not require new users to have any prior knowledge about ancient Chinese characters. We also design a system suitable for describing the relationships between ancient Chinese characters and contemporary ones. As the implementation result, a website is established for public access to ancient Chi-nese characters.

Innovation Demo 技术成果展示

Innovation Demo 技术成果展示返回时间：2013年11月17日晚上-11月19日中午；君豪-大宴会厅
参展单位	展示成果	展位/联系人	成果简介
明博教育科技有限公司	优课数字化教学应用系统	A展位 / 扈超	“优课数字化教学应用系统V3”是国内首家以正版教材内容为核心的教学应用系统，定位于基础教育课堂信息化同步教学应用的工具和服务平台。该系统由云服务平台以及教师、学生、机构管理三大客户端软件组成。以海量的教学资源和丰富的客户端应用，为终端用户提供教学应用、教学互动以及资源管理共享等全环节服务支撑，为教育机构快速构建智能、高效、开放、易用的教学应用平台。
四川省计算机研究院	四川移动导游系统	B展位 / 刘营	四川移动导游系统实现了一机在手,游遍四川。该系统利用现代信息技术和移动手机技术,推出了基于GPS(GIS)定位的旅游信息推送,景点音频/视频/图片/文字解说,景点浏览、搜索及电子地图导游，并支持多国语言。该系统很好地满足了游客无论是旅游出行前规划吃住行游，旅游中的导游解说、周边搜索及线路导航, 还是旅游结束时邮购旅游产品和发表旅游攻略等的需求。系统收罗了四川所有的4A级以上景区、四川特色景区及农家乐，支持在线导游模式和离线阅读模式。
福建星网锐捷网络有限公司		C展位 / 张向阳	锐捷网络，是中国网络解决方案领导品牌。自2000年1月成立以来，公司秉承“敏锐把握应用趋势，快捷满足客户需求”的核心经营理念，用持续创新的技术及解决方案，实现用户网络应用价值的最大化。目前，锐捷网络已连续6年成为“中国企业网第一民族品牌”,位居中国网络市场三大供应商之列。今天的锐捷网络，在全国拥有3000名员工，38个分支机构，5个研发中心（福州、上海、北京、成都、天津），营销及服务网络覆盖全国和东南亚、欧洲、南北美洲、中东等国际市场。其业界领先的IP网络、IP安全、IT运维管理等全系列产品及解决方案广泛应用于政府、金融、教育、医疗、企业、运营商等国内外信息化建设领域。锐捷网络率先在国内发布首个全面具备云计算特性的数据中心交换机产品家族，成为云计算平台的网络的民族领航者。锐捷网络解决方案已广泛应用于全国1950余所高校及20000多所中小学。已连续7年位居教育行业市场占有率排名第一。并凭借卓越的端到端解决方案能力，锐捷网络为北京奥运会、广州亚运会、深圳大运会、伦敦奥运会报道、中国下一代互联网示范工程(CNGI)等国家级重点网络建设工程提供全面的网络技术支持。 2010年6月23日，集团公司福建星网锐捷通讯股份公司在深交所成功挂牌上市（股票代码：002396）。集团公司的上市，为锐捷网络提供了一个更加广阔的发展平台和更加坚实的发展基础。
微软亚洲研究院	微软亚洲研究院自然语言处理创新成果	D展位 / 周明	微软对联(http://duilian.msra.cn) 微软对联是由世界上第一套人工智能对联系统。当用户给定上联，它能够自动提供若干下联供用户选择；并且当用户确定一副对联后，它还能够生成若干四字横批供用户参考。微软必应词典(http://cn.bing.com/dict/) 微软亚洲研究院研发的新一代在线词典，是微软首款中英文智能词典。可提供中英文单词和短语查询、自动翻译、网络新词解释、语音朗读等众多特色功能，为英文学习和英文写作提供帮助。微软必应词典依托微软强大的技术实力，及时发现并收录网络新兴词汇，让您的词典永不过时。微软云输入法（http://bing.msn.cn/pinyin/）微软亚洲研究院研发的新一代输入法。利用网络挖掘获取最新词汇和语料，快速更新语言模型和翻译模型，提供云上的输入服务。同时整合了必应的搜索体验，提供了输入法应用开放平台。进一步提升了输入法的智能化水平。微软输入法界面十分干净，无广告、无插件。即使是在性能相对不高的电脑上，也可以流畅的输入。目前已用于微软必应搜索和微软视窗操作系统中。微软英库问答微软亚洲研究院研发的领先的通用领域问答系统。对用户给出的问题，通过对自然语言问题的深度理解，从知识库、互联网以及问答社区获取候选答案和证据，并通过自动推理、答案排序、可信度估计等步骤，最终提供精确的答案。英库问答可以广泛用于自然语言搜索、人机接口、商业智能、语音助手等应用场景中。
北京网感至察科技有限公司	TML文本挖掘编程语言及其应用	E展位 / 李佳静	TML（Text Mining Language）是一种通用的文本挖掘编程语言, 旨在提供一种简单通用的途径,让用户能够对文本中的各种语义目标进行分析和计算。我们为此设计了这个语言的语法、编译器、虚拟机和图形化开发与调试环境，使用户可以针对任何应用领域轻松地进行编程以制定文本挖掘的分析目标、分析范围和分析手段, 用户代码进而会被编译成字节码在虚拟机内高效执行。TML高效地实现了大量实用文本分析技术,包括网络爬虫、文本抽取、分词、词性标注、命名实体抽取、文本分类、情感分析、概念与关系抽取等。这些技术以计算符号和保留词的形式体现在TML语法中。使用TML，构建了购买意愿、品牌口碑和竞争情报三个领域的知识库，并以此为基础推出了面向企业客户的营销、口碑、情报云服务。
哈尔滨工业大学	哈工大社会计算与信息检索研究中心成果	F展位 / 秦兵	哈工大社会计算与信息检索研究中心的研究方向包括语言分析、信息抽取、社会网络、用户分析、问答系统和情感分析六个方面。研究中心在多项国家项目和企业合作项目的支持下打造出语言技术平台、开放域信息抽取、篇章语义分析及社会媒体分析等技术平台。其中我实验室开发的“微博情绪指数平台”通过实时地收集和分析微博数据，可监测出全国各省的情绪走势图，并标注出情绪触发事件词。为舆情分析中的突发事件识别、情绪监测，以及电子商务中的用户需求分析和产品推荐等提供了必要的技术支撑。
数字出版技术国家重点实验室（北大方正集团有限公司）	普适性文档（CEBX）技术及其应用	G展位 / 汤帜	CEBX是新一代普适性文档技术，融合了固定版式信息和结构化的流式信息，解决了数字出版中的终端多样化所带来的问题，使一个文档能够同时支持PC、手机、平板电脑、电子阅读器等终端的阅读，可以实现一次制作、多平台多次利用，既可以原版原式地显示或打印，又可以在移动设备上更好地实现高质量的屏幕自适应和实时排版。同时，其原版原式、动态交互等诸多特性，也使得CEBX能够在文档存储、办公自动化、电子病历、电子书包等多个行业、领域中有着广泛的应用。方正阿帕比（Apabi）采用CEBX技术研发的明星产品——“中华数字书苑”多次被国家领导人作为国礼赠送给英国剑桥大学、比利时鲁汶大学、德国柏林国家图书馆等海外机构。
华为技术有限公司	华为大数据存储助力欧洲核子研究中心应对EB级数据挑战	H展位 / 袁超	CERN是世界上最大的粒子物理研究组织，每年有超过20 PB的有关大型强子对撞机（Large Hadron Collider）的研究数据需要存储和分析。CERN最新发现希格斯粒子（上帝粒子），是当前物质理论中最后一个被发现的粒子,揭开了充斥在宇宙中的暗物质的神秘本质。华为拥有业界领先的存储专家和卓越的技术经验， 2012年年初，华为大数据存储系统交付到CERN，在发现上帝粒子的大规模并行计算和数据存储中，表现出卓越的读写性能和极佳的可扩展性。该系统还具备智能的自我修复功能，大大降低了维护成本，同时有效地提高了存储系统的可用性和可靠性。 “CERN在执行数据密集型模拟和分析方面正在面临极限，同华为的合作为我们展现了一个令人激动的新途径，我们看到华为大数据存储优良的架构设计，使得CERN在应对未来EB级数据量的挑战时能够轻松以对。”CERN OpenLAB的总负责人鲍勃•琼斯说。
搜狗	搜狗知立方	J展位 / 王伟达	为了让用户获取信息更简单，搜狗搜索发布全新的知识库搜索引擎--“知立方”，这是国内搜索引擎行业中首家知识库搜索产品。知立方通过整合海量的互联网碎片化信息，对搜索结果进行重新优化计算，将最核心的信息展现给用户。而这就需要区别于传统的“关键词搜索”，不是单纯的抓取网页数据，而是引入“语义理解”技术，试图理解用户的搜索意图，才能将搜索结果准确地传递给用户。
重庆大学	重庆大学中文计算研究	K展位 / 李宽	目前重庆大学的中文计算研究团队主要致力于文本分类技术的研究，包括文本表示及短文本分类等方面。文本表示是文本分类的重要步骤，为克服词袋模型维数过高且不能表示词语之间相互关系以及现有语义分析过于复杂的缺点，提出了简明语义分析技术，通过文档的类别构建概念空间来实现简洁的语义分析。实验表明，这种方法在文本分类中可以达到或超过其他表示方法所能取得的最好成绩，同时可大大减少计算时长。短文本分类是近来研究的热点之一，其中的一项研究把英中双语例句分为英语口语/书面语,以帮助我国英语学习者区分两种语类，主要探索在信息缺乏的短文本中如何更好地利用各种特征。实验表明，“中文汉字+句子长度+平均音节数”或类似的特征组合能取得较好的分类效果。

NLP&CC 2013校园开放日

时间：2013年11月19日14:00-17:00 地点：重庆大学主教学楼返回
学术交流	朱晓瑾教授
技术交流	◆微软亚洲研究院 ◆哈尔滨工业大学 ◆四川省计算机研究院 ◆数字出版技术国家重点实验室 ◆重庆大学 ◆华为技术有限公司（展示车）
活动安排	◆14:00-15:00，朱晓瑾 ◆15:00-15:30，机构宣讲，哈尔滨工业大学 ◆15:30-16:00，机构宣讲，数字出版技术国家重点实验室 ◆16:00-16:30，机构宣讲，华为技术有限公司 ◆16:30-17:00，学术和技术交流

[--完--]

会议主办方
会议承办方
会议协办方
会议赞助方