Università degli Studi dell'Insubria Insubria Space

InsubriaSPACE - Thesis PhD Repository >
Insubria Thesis Repository >
01 - Tesi di dottorato >

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/10277/278

Autori: Carullo, Moreno
Tutor non afferente all'Università: CRESTANI, FABIO
Titolo: Web content mining with multi-source machine learning for intelligent web agents.
Abstract: The web is recognized as the largest data source in the world. The nature of such data is characterized by partial or no structure, and even worse there exist no standard data schema for the even low-volumed structured data. Web Mining aims to extract useful knowledge from the Web by using a variety of techniques that have to cope with the heterogeneity and lack of a unique and fixed way of representing information. An important aspect in Web Mining is played by the automation of extraction rules with proper algorithms. Machine Learning techniques have been successfully applied toWeb Mining and Information Extraction tasks thanks to the generalization and adaptation capabilities that are a key requirement on general content, heterogeneous web pages. The World Wide Web is a graph, more precisely a directed labeled graph where the nodes are represented by the pages and the edges are represented by links between them. Recent works propose the exploitation of the web structure (Link Analysis) for content extraction, for example one can leverage the content category of neighbor pages to categorize the contents of difficult web pages where word-frequency-based techniques are not robust enough. In this thesis we propose an automated method suitable for a wide range of domains based on Machine Learning and Link Analysis. In particular we propose an inductive model able to recognize content pages where structured information is located after being trained with proper input data. In order to keep the recognition speed high enough for real-world applications an additional algorithm is proposed which lets the approach to boost both in speed and quality. The proposed method has been tested with controlled dataset in a classic train-and-test scenario and in a real-world web crawling system.
Parole chiave: web mining, machine learning
Data: 2011
Lingua: en
Corso di dottorato: Informatica
Ciclo di dottorato: 23
Università di conseguimento titolo: Università degli Studi dell'Insubria
Citazione: Carullo, M.Web content mining with multi-source machine learning for intelligent web agents. (Doctoral Thesis, Università degli Studi dell'Insubria, 2011).

Full text:

File Descrizione DimensioniFormatoConsultabilità
Phd_thesis_carullo_completa.pdftesto completo tesi3,3 MBAdobe PDFVisualizza/apri

Tutti i documenti archiviati in InsubriaSPACE sono protetti da copyright. Tutti i diritti riservati.

Segnala questo record su




Stumble it!



  ICT Support, development & maintenance are provided by the AePIC team @ CILEA. Powered on DSpace Software.  Feedback