Sunday, 26 October 2014

Large Scale Document Clustering: Clustering and Searching 50 Million Web Pages

Document clustering analyses written language in unstructured text to place documents into topically related groups or clusters. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an unsupervised manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All semantic information is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together.

The cluster hypothesis stated by van Rijsbergen in 1979, "asserts that two documents that are similar to each other have a high likelihood of being relevant to the same information need". As document clustering places similar documents in the same cluster, the cluster hypothesis supports the notion that only a small fraction of document clusters need to be searched to fulfil a users information need when submitting a query to a search engine.

A section on exploiting large scale document clustering and the cluster hypothesis to produce a more efficient search engine is available in my PhD thesis, and it is titled "Distributed information retrieval: Collection distribution, selection and the cluster hypothesis for evaluation of document clustering". The work brings together of two lines of related research. It builds upon the evaluation of document clustering started at the INEX XML Mining track. It also extends work with the K-tree data structure and algorithm, presenting for the first time the TopSig K-tree that works with binary document signatures produced by TopSig. It allows the 50 million web page ClueWeb09 Category B document collection used at the TREC Web Track to be clustered in 10 hours into approximately 140,000 clusters using a single thread of Java code. To the best of my knowledge such approaches clustering 50 million documents are not described in the literature without using sampling. There are many opportunities to reduce the time required to cluster via parallel and distributed processing, low level optimisation in a native programming language and using shorter document signatures.

The clusters produced by TopSig K-tree have been used with a new cluster ranking approach based upon Okapi BM25 that combines document weights to represent clusters. I have called it CBM625 as it squares BM25 term-document weights and combines them to rank clusters. The final result is that this approach is able to search 13 fold less documents than the previous best reported approach on the ClueWeb09 Category B 50 million document collection. The theoretical clustering evaluation at INEX suggested that fine grained clusters allow better ranking of clusters for collection selection.  It used relevance judgements to place documents relevant to a query in an optimal order with respect to a document clustering, which represents an upper bound for any collection selection approach given the same clustering of documents. These new experimental results demonstrate the effectiveness of fine grain document clustering using a large scale clustering algorithm that is able to produce approximately 140,000 clusters, a collection selection approach and a final ranking of documents by a search engine. The results were evaluated using queries 1-50 from the TREC 2009 Web Track. Only the first 8 most highly ranked of the total 140,000 document clusters need to be searched to ensure there is no statistically significant difference in retrieval quality.

I have made the software and some of the data available from this paper at

All of the software required to replicate the experiments is available is contained is docclust_ir.tar.gz. This includes the ATIRE search engine, TopSig search engine, the version of TopSig K-tree as described in the paper, and the collection selection approach that ranks documents using the clusterings produced by TopSig K-tree. It also includes the scripts to run everything. This code is undocumented, messy, rushed, research code but I am happy to help you with any problems you run into using it. You will also need to obtain the INEX 2009 XML Wikipedia and the ClueWeb09 Category B document collections.

Furthermore, the clusters of the INEX 2009 Wikipedia and ClueWeb09 Category B are available and the document signatures used to create the clusterings.

This brings together 4 years worth of research on document clustering evaluation and algorithms. I wish to thank everyone that has supported me along the way!


  1. What approach, framework or system did you use in order to distribute such a workload?

    Thank you

    1. The approaches outlined in this paper do not need a distributed solution to cluster and search the 50 million web pages in the ClueWeb09 Category B collection. The experiments were performed on a single machine. This simulates a distributed system where sub-collections of clusters have to be selected to be searched.

      The document clustering approach has not been parallelised at all. It is still single threaded but is able to cluster all the documents in 10 hours. There is some low hanging fruit in terms of performance gains to implement a parallel and distributed version of TopSig K-tree, or similar clustering algorithm.

      Batch oriented distributed processing frameworks such as MapReduce implemented in Hadoop may not be appropriate for the online and incremental nature of these approaches. I could definitely see the collection selection approach that recommends which clusters of which servers to search being written as a custom piece of software that co-ordinates the machines involved in the search process.

      So, to answer your question directly, I did not use any framework or system to distribute the workload. The results imply that this is possible using document clustering, and it is possible to reduce the number of documents that need to be searched.