Employing Clustering Techniques for Automatic Information Extraction From Html Documents

Ashraf, Fatima; Özyer, Tansel; Alhajj, Reda

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.11851/6633

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ashraf, Fatima	-
dc.contributor.author	Özyer, Tansel	-
dc.contributor.author	Alhajj, Reda	-
dc.date.accessioned	2021-09-11T15:43:01Z	-
dc.date.available	2021-09-11T15:43:01Z	-
dc.date.issued	2008	-
dc.identifier.issn	1094-6977	-
dc.identifier.issn	1558-2442	-
dc.identifier.uri	https://doi.org/10.1109/TSMCC.2008.923882	-
dc.identifier.uri	https://hdl.handle.net/20.500.11851/6633	-
dc.description.abstract	In the past few years, there has been an exponential increase in the amount of information available on the World Wide Web. This plethora of information can be extremely beneficial for users. However, the amount of human intervention that is currently required for this is inconvenient. Information extraction (IE) systems try to solve this problem by making the task as automatic as possible. Most of the existing approaches, however, require user feedback in one form or another during the extraction. This paper proposes a system that employs clustering techniques for automatic IE from HTML documents containing semistructured data. Using domain-specific information provided by the user, the proposed system parses and tokenizes the data from an HTML document, partitions it into clusters containing similar elements, and estimates an extraction rule based on the pattern of occurrence of data tokens. The extraction rule is then used to refine clusters, and finally, the output is reported. We employed a multiobjective genetic-algorithm-based clustering approach in the process; it is capable of finding the number of clusters and the most natural clustering. The proposed approach is tested by conducting experiments on a number of Web sites from different domains. To demonstrate the effectiveness of this approach, the results of the experiments are tested against those reported in the literature, and prove comparable.	en_US
dc.language.iso	en	en_US
dc.publisher	IEEE-Inst Electrical Electronics Engineers Inc	en_US
dc.relation.ispartof	IEEE Transactions On Systems Man And Cybernetics Part C-Applications And Reviews	en_US
dc.rights	info:eu-repo/semantics/closedAccess	en_US
dc.subject	clustering	en_US
dc.subject	Hypertext Markup Language (HTML) documents	en_US
dc.subject	information extraction (IE)	en_US
dc.subject	Web pages	en_US
dc.title	Employing Clustering Techniques for Automatic Information Extraction From Html Documents	en_US
dc.type	Article	en_US
dc.department	Faculties, Faculty of Engineering, Department of Computer Engineering	en_US
dc.department	Fakülteler, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü	en_US
dc.identifier.volume	38	en_US
dc.identifier.issue	5	en_US
dc.identifier.startpage	660	en_US
dc.identifier.endpage	673	en_US
dc.identifier.wos	WOS:000259192000004	-
dc.identifier.scopus	2-s2.0-50649094223	-
dc.institutionauthor	Özyer, Tansel	-
dc.identifier.doi	10.1109/TSMCC.2008.923882	-
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı	en_US
dc.identifier.scopusquality	N/A	-
item.cerifentitytype	Publications	-
item.languageiso639-1	en	-
item.grantfulltext	none	-
item.openairetype	Article	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.fulltext	No Fulltext	-
crisitem.author.dept	02.1. Department of Artificial Intelligence Engineering	-
Appears in Collections:	Bilgisayar Mühendisliği Bölümü / Department of Computer Engineering Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection WoS İndeksli Yayınlar Koleksiyonu / WoS Indexed Publications Collection

Show simple item record

CORE Recommender

SCOPUS^TM
Citations

28

checked on Sep 6, 2025

WEB OF SCIENCE^TM
Citations

13

checked on Sep 6, 2025

Page view(s)

214

checked on Sep 8, 2025

Google Scholar^TM

Check

SCOPUSTM Citations

WEB OF SCIENCETM Citations

Page view(s)

Google ScholarTM

Altmetric

SCOPUS^TM
Citations

WEB OF SCIENCE^TM
Citations

Google Scholar^TM