AccessMyLibrary provides FREE access to millions of articles from top publications available through your library.
At the present time, the Internet is general and many people use the Internet to find information. A variety of web pages and the frequently changing of information in web pages make searching and extracting information very difficult. When Internet users want to get information, they first visit search engines such as Yahoo and Google and then visit all web sites suggested by the search engine.
Many researchers such as (7), (10), (16), (17) research on extraction of information from web pages in different domains (traveling, products, business intelligence) but these researches deal with limited web pages and the user still need to use the search engines such as Yahoo and Google to collect more information.
Many of the web pages that the corporations used to announce their products (Internet shops) consist of attributes, sub attributes and values of sub attributes. The sub attributes and values of sub attributes represent the relevant information that the user needs. Products in a single group (web pages) in a single store (Internet shop) tend to have the same attributes, while products in different groups (web pages) have different sets of attributes, for instance:
* One Internet shop presents the attributes, the other does not
* The same attribute is identified differently
* The same attribute contains different kinds of data (sub attributes)
We have proposed a framework for extracting and classifying web pages which consists of three main components: (i) Query Interface (QI) which is used for accepting user's queries and searching web pages based on the user's queries through search engine, (ii) Information Extraction (IE) extracts the relevant information from various web pages obtained from QI and (iii) Relevant Information Analyzer (RIA) analyses the extracted information and removes the repeated information of the same product.
Related works: Many researchers proposed approaches for extracting information from HTML web pages as discussed below.
The Information Systems Universal Data Browser (IS UDB) (7) which has been proposed by Guntis Arnicans and Girts Karnitis is used for searching, extracting, analyzing, classifying, translating, storing, integrating and browsing HTML data. The IS UDB deals with limited HTML data sources (web pages), thus user needs to use search engines such as Yahoo and Google to get the required information.
Another stream of researcher works on extraction of information with agent. Jung et al. (17) proposed an Intelligent Traveler Support System (ITSS) for helping traveler to get information about traveling that allows traveler to find important information more easily and effectively. The system deals with limited web pages which are related to destinations and weather. Thus, travelers need to search through the numerous web pages to gather all the necessary information by using search engines such as Yahoo and Google.
Tina Eliassi-Rad and Jude Shavlik (18) have proposed a Wisconsin Adaptive Web Assistant (WAWA) system. They have presented a system for rapidly and easily building instructable and self-adaptive software agents that retrieve and extract information. WAWA interacts with the user and an online (textual) environment (e.g., the Web) to build an intelligent agent for retrieving and extracting information. The proposed system needs to embed into a major existing Web browser, thereby minimizing new interface features that users must learn in order to interact with this system as well as develop methods whereby WAWA can automatically infer reasonable training examples by observing users' normal use of their browsers.
Lam et al. (14) proposed a system which used different methodologies to extract the information. The extraction task is only individual page based. It means that all the fields for the same record are supposed to be contained in the same page. However, in many other situations, the fields may be located in different relevant pages, such as several linked web pages. Therefore this system needs to handle multi-page extractions.
Fatima Ashraf et al. (14) have employed clustering techniques for automatic information extraction from HTML documents containing HTML data. They proposed a system which is called ClusTex. They extend the work in Fatima Ashraf and Reda Alhajj (3) by testing their proposed system in different domains such as Cell phone sales and Marathon schedule. If the tokens of one kind differ from each other in format, then this leads to an incorrect clustering of some tokens.
Saggion et al. (10) proposed the MUSING project (Multi-industry, Semantic-based next generation business intelligence). The MUSING project needs to cover many semantic categories including locations, organizations and specific business events to help companies that want to take their business overseas and concerned in knowing the best place to exploit.
Jansen et al. (1) proposed a model to improve web search engines by classifying user search based on intention in terms of the type of content specified and operationalize these classifications with defining …