Construction and Analysis of Web-Based Computer Science Information Networks

Speaker:
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign

ABSTRACT
WINACS (Web-based Information Network Analysis for Computer Science) is an on-going research project that incorporates many recent, exciting developments in data sciences to construct a Web-based computer science information network, and discover, retrieve, rank, cluster, and analyze such an information network. With the rapid development of the Web, huge amounts of information are available on the Web in the form of Web documents, structures, and links. It has been a dream of the database and Web communities to harvest such information and reconcile the unstructured nature of the Web with the neat, semi-structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. However, with the recent research in Web structure mining and information network analysis, it becomes realistic to discover web hidden structures, construct heterogeneous information networks by integration of information from structured databases and Web contents, and perform in-depth analysis for systematic harvesting of such rich information on the Web.
Taking computer science as a dedicated domain, WINACS first discovers Web entity structures, integrates the contents in the DBLP database with that on the Web to construct a heterogeneous computer science information network. With this structure in hand, WINACS is able to rank, cluster and analyze this network and support intelligent and analytical queries. In this talk, we will discuss the principles of information network-based web mining, show multiple salient features of WINACS and demonstrate how computer science Web pages and DBLP can be nicely integrated to support queries and mining in highly friendly and intelligent ways. We envision the methodologies can be extended to handle many other exciting information networks extracted from the Web, such as general academia, governments, sports and so on.
This project is being developed at the Data Mining Research Group in Computer Science, Univ. of Illinois, supported by multiple research funds.