Università degli Studi dell'Insubria Insubria Space

InsubriaSPACE - Thesis PhD Repository >
Insubria Thesis Repository >
01 - Tesi di dottorato >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10277/278

Authors: Carullo, Moreno
Title: Web content mining with multi-source machine learning for intelligent web agents.
Abstract: The web is recognized as the largest data source in the world. The nature of such data is characterized by partial or no structure, and even worse there exist no standard data schema for the even low-volumed structured data. Web Mining aims to extract useful knowledge from the Web by using a variety of techniques that have to cope with the heterogeneity and lack of a unique and fixed way of representing information. An important aspect in Web Mining is played by the automation of extraction rules with proper algorithms. Machine Learning techniques have been successfully applied toWeb Mining and Information Extraction tasks thanks to the generalization and adaptation capabilities that are a key requirement on general content, heterogeneous web pages. The World Wide Web is a graph, more precisely a directed labeled graph where the nodes are represented by the pages and the edges are represented by links between them. Recent works propose the exploitation of the web structure (Link Analysis) for content extraction, for example one can leverage the content category of neighbor pages to categorize the contents of difficult web pages where word-frequency-based techniques are not robust enough. In this thesis we propose an automated method suitable for a wide range of domains based on Machine Learning and Link Analysis. In particular we propose an inductive model able to recognize content pages where structured information is located after being trained with proper input data. In order to keep the recognition speed high enough for real-world applications an additional algorithm is proposed which lets the approach to boost both in speed and quality. The proposed method has been tested with controlled dataset in a classic train-and-test scenario and in a real-world web crawling system.
Keywords: web mining, machine learning
Issue Date: 2011
Language: en
Doctoral course: Informatica
Academic cycle: 23
Publisher: Università degli Studi dell'Insubria
Citation: Carullo, M.Web content mining with multi-source machine learning for intelligent web agents. (Doctoral Thesis, Università degli Studi dell'Insubria, 2011).

Files in This Item:

File Description SizeFormatVisibility
Phd_thesis_carullo_completa.pdftesto completo tesi3,3 MBAdobe PDFView/Open

Items in InsubriaSPACE are protected by copyright, with all rights reserved, unless otherwise indicated.

Share this record




Stumble it!



  ICT Support, development & maintenance are provided by the AePIC team @ CILEA. Powered on DSpace Software.  Feedback