Mining XML Document Based on Structure

Mehdi G. Duaimi, YasirAbd Alhamed


With the growing number of XML documents on the Web it becomes essential to effectively organize these XML documents in order to retrieve useful information from them. A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing. This paper presents a framework for clustering XML documents by structure. Modelling the XML documents as rooted ordered labelled trees, we study the usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents. We suggest the usage of structural summaries for trees to improve the performance of the distance calculation and at the same time to maintain or even improve its quality.

Keywords: XML, Tree Similarity Measure, structural summary, clustering, DTDs

