XBeGene: Scalable XML Documents Generator by Example Based on Real Data

XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools fo...

Full description

Saved in:
Bibliographic Details
Main Author: Harazaki, Manami (author)
Other Authors: Tekli, Joe (author), Yokoyama, Shohei (author), Fukuta, Naoki (author), Chbeir, Richard (author), Ishikawa, Hiroshi (author)
Format: conferenceObject
Published: 2012
Online Access:http://hdl.handle.net/10725/5869
http://dx.doi.org/10.1007/978-3-642-28807-4_63
http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php
https://link.springer.com/chapter/10.1007/978-3-642-28807-4_63
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:XML datasets of various sizes and properties are needed to evaluate the correctness and efficiency of XML-based algorithms and applications. While several downloadable datasets can be found online, these are predefined by system experts and might not be suitable to evaluate every algorithm. Tools for generating synthetic XML documents underline an alternative solution, promoting flexibility and adaptability in generating synthetic document collections. Nonetheless, the usefulness of existing XML generators remains rather limited due to the restricted levels of expressiveness allowed to users. In this paper, we develop a novel XML By example Generator (XBeGene) for producing synthetic XML data which closely reflect the user's requirements. Inspired by the query-by-example paradigm in information retrieval, Our generator system i)allows the user to provide her own sample XML documents as input, ii) analyzes the structure, occurrence frequencies, and content distributions for each XML element in the user input documents, and iii) produces synthetic XML documents which closely concur, in both structural and content features, to the user's input data. The size of each synthetic document as well as that of the entire document collection are also specified by the user. Clustering experiments demonstrate high correlation levels between the specified user requirements and the characteristics of the generated XML data, while timing results confirm our approach's scalability to large scale document collections.