基于HTMLParser 的Web 信息抽取系统的设计与实现

辽宁石油化工大学学报 ›› 2006, Vol. 26 ›› Issue (2): 83-86.

基于HTMLParser 的Web 信息抽取系统的设计与实现

李彦刚, 魏海平^＊, 侯兴华

辽宁石油化工大学计算机与通信工程学院, 辽宁抚顺113001

收稿日期:2005-12-02 出版日期:2006-02-25 发布日期:2006-02-25
作者简介:李彦刚(1980 -), 男, 河南郑州市, 在读硕士。

Design and Implementation of Web Information Extraction System Based on HTMLParser

School of Computer and Communication Engineering , Liaoning University of Petroleum & Chemical Technology ,Fushun Liaoning 113001, P .R .China

Received:2005-12-02 Published:2006-02-25 Online:2006-02-25

摘要/Abstract

摘要： 互联网上信息量的激增, 迫切需要一些自动化的工具帮助人们在海量信息源中迅速找到真正需要的信息, 如标题、链接、email 和图片等, 而HTML 语言所表述的Web 页面经浏览器分析后只适合浏览, 不适合作为一种数据交换的方式由机器处理。介绍了HTMLParser 的原理和java 正则表达式相关知识, 基于HTMLParser 包和正则表达式。以提取网站内部email 信息为例, 提出了Web 信息抽取系统设计方案, 阐述了email 信息抽取的工作原理和关键技术, 给出了email 抽取算法, 并详细介绍了系统的抽取URL、email 和存储模块, 抽取结果保存于数据库中,供机器检索利用。

关键词: 信息抽取, 　正则表达式, 　HTMLParser 包, 　Java

Abstract:

　 The rapid growth of the Web contents increases the need for some automatic tools to help to find the exact information among the magnanimous information sources such as titles , links, emails, pictures etc .The Web pages expressed by HTML, after analyzed by Internet Explorer , are suitable for browse , but not for machine processing as the way of data exchange .The principle of HTM LParser and related knowledge of regular expression , package HTM LParser and regular expression were introduced .Taking extracting email information inside websites as an example , the scheme of design was proposed.The principle of email extraction and key technique were presented.The algorithm of email extraction was given .URL extraction module , email extraction module and storage module were described in detail.The result of extraction is stored in database for the use of data retrieval.

Key words: 　Information extraction , 　Regular expression, 　Package HTM LParser , 　Java

李彦刚, 魏海平,侯兴华. 基于HTMLParser 的Web 信息抽取系统的设计与实现[J]. 辽宁石油化工大学学报, 2006, 26(2): 83-86.

LI Yan -g ang,WEI Hai -ping. Design and Implementation of Web Information Extraction System Based on HTMLParser[J]. Journal of Liaoning Petrochemical University, 2006, 26(2): 83-86.

[1]	侯兴华，魏海平，王福威，刘艳艳. 基于Java流的无组件文件上传的研究与实现[J]. 辽宁石油化工大学学报, 2007, 27(3): 64-66.
[2]	赵新慧, 李文超. 基于JMF 的远程教学系统的设计与实现[J]. 辽宁石油化工大学学报, 2005, 25(2): 74-77.
[3]	赵新慧. 可变传输速率的远程控制软件的实现[J]. 辽宁石油化工大学学报, 2005, 25(1): 86-88.