GenDE: A CRF-Based Data Extractor
Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema,...
محفوظ في:
| المؤلف الرئيسي: | |
|---|---|
| مؤلفون آخرون: | |
| منشور في: |
2020
|
| الموضوعات: | |
| الوصول للمادة أونلاين: | https://bspace.buid.ac.ae/handle/1234/3044 https://doi.org/10.13052/jwe1540-9589.19342. |
| الوسوم: |
إضافة وسم
لا توجد وسوم, كن أول من يضع وسما على هذه التسجيلة!
|
| _version_ | 1862980614189219840 |
|---|---|
| author | Kayed, Mohammed |
| author2 | Shaalan, Khaled |
| author2_role | author |
| author_facet | Kayed, Mohammed Shaalan, Khaled |
| author_role | author |
| dc.creator.none.fl_str_mv | Kayed, Mohammed Shaalan, Khaled |
| dc.date.none.fl_str_mv | 2020 2025-05-14T14:26:00Z 2025-05-14T14:26:00Z |
| dc.identifier.none.fl_str_mv | Kayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404. 1540-9589, 1544-5976 https://bspace.buid.ac.ae/handle/1234/3044 https://doi.org/10.13052/jwe1540-9589.19342. |
| dc.language.none.fl_str_mv | en |
| dc.publisher.none.fl_str_mv | River Publishers |
| dc.relation.none.fl_str_mv | Journal of Web Engineeringv19 n3-4 (2020): 371-404 |
| dc.subject.none.fl_str_mv | Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction. |
| dc.title.none.fl_str_mv | GenDE: A CRF-Based Data Extractor |
| dc.type.none.fl_str_mv | Article |
| description | Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%). |
| id | budr_59041614dd5f3be26fbfa4fb0a6952be |
| identifier_str_mv | Kayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404. 1540-9589, 1544-5976 |
| language_invalid_str_mv | en |
| network_acronym_str | budr |
| network_name_str | The British University in Dubai repository |
| oai_identifier_str | oai:bspace.buid.ac.ae:1234/3044 |
| publishDate | 2020 |
| publisher.none.fl_str_mv | River Publishers |
| repository.mail.fl_str_mv | |
| repository.name.fl_str_mv | |
| repository_id_str | |
| spelling | GenDE: A CRF-Based Data ExtractorKayed, MohammedShaalan, KhaledWrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction.Web site schema detection and data extraction from the Deep Web have been studied a lot. Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types. If the wrapper failed to work with the new page, a new wrapper/schema would be re-generated by calling an unsupervised wrapper induction system. In this paper, a new data extractor called GenDE is proposed. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%).River Publishers2025-05-14T14:26:00Z2025-05-14T14:26:00Z2020ArticleKayed, M. and Shalaan, K. (2020) “GenDE: A CRF-Based Data Extractor,” Journal of Web Engineering, 19(3-4), pp. 371–404.1540-9589, 1544-5976https://bspace.buid.ac.ae/handle/1234/3044https://doi.org/10.13052/jwe1540-9589.19342.enJournal of Web Engineeringv19 n3-4 (2020): 371-404oai:bspace.buid.ac.ae:1234/30442025-05-14T14:27:53Z |
| spellingShingle | GenDE: A CRF-Based Data Extractor Kayed, Mohammed Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction. |
| title | GenDE: A CRF-Based Data Extractor |
| title_full | GenDE: A CRF-Based Data Extractor |
| title_fullStr | GenDE: A CRF-Based Data Extractor |
| title_full_unstemmed | GenDE: A CRF-Based Data Extractor |
| title_short | GenDE: A CRF-Based Data Extractor |
| title_sort | GenDE: A CRF-Based Data Extractor |
| topic | Wrapper induction, data extractor, wrapper verifier, sequence labeling, CRFs model, JSON data extraction. |
| url | https://bspace.buid.ac.ae/handle/1234/3044 https://doi.org/10.13052/jwe1540-9589.19342. |