Journal of Liaoning Petrochemical University

Journal of Liaoning Petrochemical University ›› 2017, Vol. 37 ›› Issue (4): 61-64.DOI: 10.3969/j.issn.1672-6952.2017.04.014

Previous Articles     Next Articles

Research on Keyword Extraction Algorithm Based on Improved TF-IDF

Jia Qiang1, Feng Xiwei1, Wang Zhifeng1, Zhu Rui1, Qin Hang2   

  1. 1.School of Computer and Communicating Engineering,Liaoning Shihua University, Fushun Liaoning 113001, China;2.Teacher Continuing Education School of Wanghua District, Fushun City of Liaoning Province, Fushun Liaoning 113001, China
  • Received:2017-03-08 Revised:2017-04-11 Published:2017-08-25 Online:2017-08-29

基于改进的TF-IDF文本特征词提取算法研究

贾 强1, 冯锡炜1, 王志峰1, 朱 睿1, 秦 航2   

  1. 1.辽宁石油化工大学 计算机与通信工程学院,辽宁 抚顺 113001; 2.辽宁省抚顺市望花区教师进修学校,辽宁 抚顺 113001
  • 通讯作者: 冯锡炜(1970-),男,博士,教授,从事语义网、分布式计算与计算机网络技术方面的研究;E-mail:feng.xw@163.com。
  • 作者简介:贾强(1989-),男,硕士研究生,从事语义网和Hadoop大数据处理研究;E-mail:616649172@qq.com。
  • 基金资助:
    辽宁省教育科学“十三五”规划课题资助项目(JG16DB253);辽宁石油化工大学教育教学改革研究项目(20165230060003)。

Abstract: In the text feature word extraction algorithm,TF-IDF algorithm is the most common feature weight calculation method. On the basis of the traditional TF-IDF extract algorithm, a new keyword extraction algorithm based on the text word length is proposed.Using chinese phrase word segmentation technique to identify long words and ordinary words in text,the proposed TF-IDF-WL method is used to recompute weights for different lengths of words, and the keywords are sorted by weights. Experimental results show that the new feature word extraction algorithm can more accurately reflect the lexical length of the feature words.Compared with the traditional TF-IDF algorithm, the algorithm has greatly improved accuracy and recall rate.

Key words: TF-IDF, Keyword extraction, Word length, Text preprocessing, Text classification

摘要: 在特征词提取算法中,TF-IDF算法是最常见的特征权重计算方法。在传统TF-IDF算法的基础上,提出新的基于文本词语长度的关键词提取算法。利用中文短语分词技术,识别文本中的长词与普通词汇,对于不同长度的词语利用提出的TF-IDF-WL方法重新计算权重,按权值排序结果得到关键词。实验对比发现,新的特征词提取算法能够更加精确地反映出特征词的词长情况,该算法与传统的TF-IDF算法相比,在准确率和召回率上都有较大的提升。

关键词: TF-IDF, 特征词提取, 词长, 文本预处理, 文本分类

Cite this article

贾 强, 冯锡炜, 王志峰, 朱 睿, 秦 航. Research on Keyword Extraction Algorithm Based on Improved TF-IDF[J]. Journal of Liaoning Petrochemical University, 2017, 37(4): 61-64.

贾 强, 冯锡炜, 王志峰, 朱 睿, 秦 航. 基于改进的TF-IDF文本特征词提取算法研究[J]. 辽宁石油化工大学学报, 2017, 37(4): 61-64.