Open Access Open Access  Restricted Access Subscription or Fee Access

Web Content Extraction Based on Tag-Row-Block Web Mining

T. Velumani

Abstract


Information retrieval and Web mining are closely related to the content of Web pages. Due to the rapid development of Web technology, some previous automatic Web content extraction methods are no longer well suited to the current situation. Therefore, this paper proposes a universal content extraction method (CETRB) based on Tag-Row-Block, which is suitable for the current environment. The CETRB method, which combines the visual features and functional features of HTML tags with the row-block distribution function, further improves the precision, recall, and extraction efficiency of the existing Web content extraction methods, and solves the problems of manual threshold setting and universality of multi-source information. Our empirical study on different types of real-world Web pages demonstrates that the method we proposed has great extraction effect and high efficiency for single-content and multi-content pages, English and Chinese pages. It also supports the extraction of multimedia information, and the display effect is consistent with the original semantics, which can provide a comfortable reading experience. At the same time, we conducted the same experiments using the current popular Chinese extraction method Cx-Extractor and the English extraction method Readability. The conclusion is that our proposal outperforms them in terms of precision and recall, and is superior to the Readability method in terms of extraction efficiency. At the same time, it has great advantages in terms of universality and user reading experience.

Keywords:Content extraction, information retrieval, Web data mining, knowledge-based Systems, Web Content.

Cite this Article: T. Velumani. Web Content Extraction Based on Tag-Row-Block Web Mining. Journal of Web Engineering & Technology. 2020; 7(2): 18–31p.


Full Text:

PDF

Refbacks

  • There are currently no refbacks.


This site has been shifted to https://stmcomputers.stmjournals.com/