Improved Vision Based Algorithm for Deep Web Data Extraction

Rashmi Chaudhary, Dr. Arun Solanki


Several systems and languages have been proposed for solving web-data management problems, but none of existing system addresses all the problems from a unified perspective. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. New web content, structure analysis based on visual representation is proposed in this paper. The web page is partitioned into blocks using the vision-based page segmentation algorithm. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this proposed methodology. This research presents an automatic top-down, tag-tree independent approach to detect web content structure. The proposed system simulates a user who understands web layout structure based on his/her visual perception. Comparing to other existing techniques, the proposed approach is independent to underlying documentation, representation such as HTML and works well even when the HTML structure is far different from layout structure.

Keywords: Deep web, cluster, VIPS, DOM, data region extraction, data record extraction


Rashmi Chaudhary, Dr. Arun Solanki. Improved Vision Based Algorithm for Deep Web Data Extraction. Journal of Web Engineering & Technology.2015; 2(2): 23–32p.

