GenDE: A CRF-Based Data Extractor

Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema,...

وصف كامل

محفوظ في:
التفاصيل البيبلوغرافية
المؤلف الرئيسي: Kayed, Mohammed (author)
مؤلفون آخرون: Shaalan, Khaled (author)
منشور في: 2020
الموضوعات:
الوصول للمادة أونلاين:https://bspace.buid.ac.ae/handle/1234/3044
https://doi.org/10.13052/jwe1540-9589.19342.
الوسوم: إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
_version_ 1862980614189219840
author Kayed, Mohammed
author2 Shaalan, Khaled
author2_role author
author_facet Kayed, Mohammed
Shaalan, Khaled
author_role author
dc.creator.none.fl_str_mv Kayed, Mohammed
Shaalan, Khaled
dc.date.none.fl_str_mv 2020
2025-05-14T14:26:00Z
2025-05-14T14:26:00Z
dc.identifier.none.fl_str_mv Kayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404.
1540-9589, 1544-5976
https://bspace.buid.ac.ae/handle/1234/3044
https://doi.org/10.13052/jwe1540-9589.19342.
dc.language.none.fl_str_mv en
dc.publisher.none.fl_str_mv River Publishers
dc.relation.none.fl_str_mv Journal of Web Engineeringv19 n3-4 (2020): 371-404
dc.subject.none.fl_str_mv Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction.
dc.title.none.fl_str_mv GenDE: A CRF-Based Data Extractor
dc.type.none.fl_str_mv Article
description Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%).
id budr_59041614dd5f3be26fbfa4fb0a6952be
identifier_str_mv Kayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404.
1540-9589, 1544-5976
language_invalid_str_mv en
network_acronym_str budr
network_name_str The British University in Dubai repository
oai_identifier_str oai:bspace.buid.ac.ae:1234/3044
publishDate 2020
publisher.none.fl_str_mv River Publishers
repository.mail.fl_str_mv
repository.name.fl_str_mv
repository_id_str
spelling GenDE: A CRF-Based Data ExtractorKayed, MohammedShaalan, KhaledWrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction.Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%).River Publishers2025-05-14T14:26:00Z2025-05-14T14:26:00Z2020ArticleKayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404.1540-9589, 1544-5976https://bspace.buid.ac.ae/handle/1234/3044https://doi.org/10.13052/jwe1540-9589.19342.enJournal of Web Engineeringv19 n3-4 (2020): 371-404oai:bspace.buid.ac.ae:1234/30442025-05-14T14:27:53Z
spellingShingle GenDE: A CRF-Based Data Extractor
Kayed, Mohammed
Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction.
title GenDE: A CRF-Based Data Extractor
title_full GenDE: A CRF-Based Data Extractor
title_fullStr GenDE: A CRF-Based Data Extractor
title_full_unstemmed GenDE: A CRF-Based Data Extractor
title_short GenDE: A CRF-Based Data Extractor
title_sort GenDE: A CRF-Based Data Extractor
topic Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction.
url https://bspace.buid.ac.ae/handle/1234/3044
https://doi.org/10.13052/jwe1540-9589.19342.