基于HTML结构特征的网页信息提取

辽宁石油化工大学学报 ›› 2009, Vol. 29 ›› Issue (3): 65-69.

基于HTML结构特征的网页信息提取

胡瑜¹,王立志²

1.天津大学计算机科学与技术学院,天津300072;2.天津大学管理学院,天津300072

收稿日期:2008-12-09 出版日期:2009-09-25 发布日期:2017-07-05
作者简介:胡瑜(1970-),男,山东济南市,高级工程师,在读博士

Page Information Extraction Based on the Structure of the HTML

HU Yu¹， WANG Li-zhi²

1.Department of Computer Science and Technology, Tianjin University, Tianjin 300072,P.R.China; 2.Department of Management, Tianjin University, Tianjin 300072,P.R.China

Received:2008-12-09 Published:2009-09-25 Online:2017-07-05

摘要/Abstract

摘要： Web上的信息很多存储在HTML 页面上,传统的网页数据抽取方法是使用包装器(Wrapper)来抽取
网页中感兴趣的数据。包装器所需的信息模式识别知识的获取是一个费时费力且需要较高智能的工作。避开了使
用Wrapper,针对新闻类网页的结构特点,从视觉角度对网页页面空间的构成进行了噪声与信息实体的划分与判断。
讨论了一种根据新闻类网页层次结构和各层节点统计信息进行新闻主体提取的方法。改进了传统的DOM 模型,增
加了层次与样式等属性作为噪声判断的依据,并对其节点添加了统计信息,利用新闻的标题、时间等外显特性,提出
并实现了一种结合正向直接抽取与反向降噪抽取新闻类网页得到结构化数据的方法。实验结果表明,用这种方法
进行新闻类网页主体信息提取的有效性。

关键词: 信息提取 , DOM , LA-DOM , HTML解析 , 噪声标记

Abstract: Large amount of information on the Web is stored as HTML documents. Traditional web page data extraction method is to use Wrapper to collect data of interest. Wrapper need the knowledge acquisition of pattern recognition, which is a time and effort consuming work, and needs high intelligence. Based on the structure features of news web pages, and from the visual perspective, the web page's space structure was partitioned into noise and information entities. A method of extracting news web pages principal part was discussed, according to the hierarchical structure and node statistical information. The traditional DOM model was improved, and the hierarchy and style attribute to distinguishing the noise and principal parts were added。Some statistic information was added to the DOM node. By utilizing the special format of news headlines and time string, a method, which combines positive information extraction and negative noise reducing, to get structured data from news web pages was proposed and implemented. Experiments show that it is effective to use the method to extract the information of news.

Key words: Information extraction DOM , LA-DOM , HTML parse , Noise mark

胡瑜,王立志. 基于HTML结构特征的网页信息提取[J]. 辽宁石油化工大学学报, 2009, 29(3): 65-69.

HU Yu,WANG Li-zhi. Page Information Extraction Based on the Structure of the HTML[J]. Journal of Liaoning Petrochemical University, 2009, 29(3): 65-69.