Publication: A tree learning approach to web document sectional hierarchy extraction
Program
KU Authors
KU-Authors
Co-Authors
Advisor
Publication Date
2010
Language
Type
Book chapter
Journal Title
Journal ISSN
Volume Title
Creative Commons license
Except where otherwised noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States
Abstract
There is an increasing availability of documents in electronic form due to the widespread use of the Internet. Hypertext Markup Language (HTML) which is mostly concerned with the presentation of documents is still the most commonly used format on the Web, despite the appearance of semantically richer markup languages such as XML. Effective processing of Web documents has several uses such as the display of content on small-screen devices and summarization. In this paper, we investigate the problem of identifying the sectional hierarchy of a given HTML document together with the headings in the document. We propose and evaluate a learning approach suitable to tree representation based on Support Vector Machines.
Description
Source:
Publisher:
Keywords:
Subject
Machine Learning, Document Structure, World Wide Web, Hypertext Markup Language, Makine Öğrenme, Belge Yapısı, Dünya Çapında Ağ, Köprü Metni Biçimlendirme Dili