Improved Vision Based Algorithm for Deep Web Data Extraction
Several systems and languages have been proposed for solving web-data management problems, but none of existing system addresses all the problems from a unified perspective. Most of the existing link analysis algorithms treat a web page as a single node in the web graph. However, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. New web content, structure analysis based on visual representation is proposed in this paper. The web page is partitioned into blocks using the vision-based page segmentation algorithm. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this proposed methodology. This research presents an automatic top-down, tag-tree independent approach to detect web content structure. The proposed system simulates a user who understands web layout structure based on his/her visual perception. Comparing to other existing techniques, the proposed approach is independent to underlying documentation, representation such as HTML and works well even when the HTML structure is far different from layout structure.
Keywords: Deep web, cluster, VIPS, DOM, data region extraction, data record extraction
Cite this Article
Rashmi Chaudhary, Dr. Arun Solanki. Improved Vision Based Algorithm for Deep Web Data Extraction. Journal of Web Engineering & Technology.2015; 2(2): 23–32p.
Wang J, Lochovsky FH. Data Extraction and Label Assignment for Web Databases. Proc. international conference on World Wide Web (WWW-12). 2003: 187–196p.
Zhai Y, Liu B. Web Data Extraction Based on Partial Tree Alignment. Proc. international conference on World Wide Web (WWW-14). 2005: 76–85p.
Cai D, Yu S, Wen J, Ma W. Extracting Content Structure for Web Page Based on Visual Representation. Proc. Asia Pacific Web Conf. (APWeb). 2003. 406–417p.
Laender M et al. Web data extraction, Application and technique. Univ. of Messina, Dept. of Mathematics and Informatics, viale F. Stagno D’Alcontres 31, I-98166 Messina, Italy dLixto Software GmbH, Austria.
Arocena GO, Mendelzon AO. WebOQL: Restructuring Documents, Databases, and Webs. Proc. Int’l Conf. Data Eng. (ICDE). 1998: 24–33p.
Ashraf F, Ozyer T, Alhajj R. Employing Clustering Techniques for Automatic Information Extraction from HTML Documents. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 2008; 38(5): 660–673p.
Lavanya M, Dhanalakshmi M. Various approaches of vision-based deep web data extraction and applications. International Journal of computer and Information Science and Engineering. 2013; 1(7).
Myllymaki Jussi. Effective Web Data Extraction with Standard XML Technologies. IBM Almaden Research Center 650 Harry Road San Jose, CA 95120, USA. 2012.
Sasikala D, Selva Kumar G. Extraction of Deep Web Contents. International Journal of Modern Engineering Research (IJMER). 2012; 2(1): 528–533p.
Shridevi A, Swami, Pujashree Vidap. Web Data Extraction and Alignment Tools: A Survey. International Journal of Scientific Engineering and Technology. 2013; 2(6): 573–578p.
Chang CH, Kayed M, Girgis MR, Shaalan KF. A Survey of Web Information Extraction Systems. IEEE Trans. Knowledge and Data Eng. 2006; 18(10): 1411–1428p.
- There are currently no refbacks.