Sunday, 26 October 2014

Large Scale Document Clustering: Clustering and Searching 50 Million Web Pages

Document clustering analyses written language in unstructured text to place documents into topically related groups or clusters. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an unsupervised manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All semantic information is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together.

The cluster hypothesis stated by van Rijsbergen in 1979, "asserts that two documents that are similar to each other have a high likelihood of being relevant to the same information need". As document clustering places similar documents in the same cluster, the cluster hypothesis supports the notion that only a small fraction of document clusters need to be searched to fulfil a users information need when submitting a query to a search engine.

A section on exploiting large scale document clustering and the cluster hypothesis to produce a more efficient search engine is available in my PhD thesis, and it is titled "Distributed information retrieval: Collection distribution, selection and the cluster hypothesis for evaluation of document clustering". The work brings together of two lines of related research. It builds upon the evaluation of document clustering started at the INEX XML Mining track. It also extends work with the K-tree data structure and algorithm, presenting for the first time the TopSig K-tree that works with binary document signatures produced by TopSig. It allows the 50 million web page ClueWeb09 Category B document collection used at the TREC Web Track to be clustered in 10 hours into approximately 140,000 clusters using a single thread of Java code. To the best of my knowledge such approaches clustering 50 million documents are not described in the literature without using sampling. There are many opportunities to reduce the time required to cluster via parallel and distributed processing, low level optimisation in a native programming language and using shorter document signatures.

The clusters produced by TopSig K-tree have been used with a new cluster ranking approach based upon Okapi BM25 that combines document weights to represent clusters. I have called it CBM625 as it squares BM25 term-document weights and combines them to rank clusters. The final result is that this approach is able to search 13 fold less documents than the previous best reported approach on the ClueWeb09 Category B 50 million document collection. The theoretical clustering evaluation at INEX suggested that fine grained clusters allow better ranking of clusters for collection selection.  It used relevance judgements to place documents relevant to a query in an optimal order with respect to a document clustering, which represents an upper bound for any collection selection approach given the same clustering of documents. These new experimental results demonstrate the effectiveness of fine grain document clustering using a large scale clustering algorithm that is able to produce approximately 140,000 clusters, a collection selection approach and a final ranking of documents by a search engine. The results were evaluated using queries 1-50 from the TREC 2009 Web Track. Only the first 8 most highly ranked of the total 140,000 document clusters need to be searched to ensure there is no statistically significant difference in retrieval quality.

I have made the software and some of the data available from this paper at

All of the software required to replicate the experiments is available is contained is docclust_ir.tar.gz. This includes the ATIRE search engine, TopSig search engine, the version of TopSig K-tree as described in the paper, and the collection selection approach that ranks documents using the clusterings produced by TopSig K-tree. It also includes the scripts to run everything. This code is undocumented, messy, rushed, research code but I am happy to help you with any problems you run into using it. You will also need to obtain the INEX 2009 XML Wikipedia and the ClueWeb09 Category B document collections.

Furthermore, the clusters of the INEX 2009 Wikipedia and ClueWeb09 Category B are available and the document signatures used to create the clusterings.

This brings together 4 years worth of research on document clustering evaluation and algorithms. I wish to thank everyone that has supported me along the way!

Saturday, 17 May 2014

Bash oneliner - say random quotes from the internet at random intervals

Here is some weekend fun. Speaking random quotes from the internet at random intervals. Only tested on my Macbook. Not sure if there is the say utility for other platforms.

while true; do say `curl -s | head -n 1 | sed s/document.write\(\'//g | sed s/\<br\>\'\)\;//g`; sleep $((RANDOM%60)); done

Sunday, 23 June 2013

Minimal Test Collection (MTC) Evaluation Utility

I have been using mtc-eval from the TREC 2009 Web Track homepage and I had troubles getting it to run without it crashing with segmentation faults. I found a newer version at the author's web page and it fixed any problems I was experiencing. Also, the GNU Scientific Library that this software depends on will install without the LAPACK and BLAS dependencies. So remember to install the lapack, lapack-devel, atlas, atlas-devel, blas and blas-devel packages found on most Linux distributions.

Saturday, 15 June 2013

ClusterEval 1.0 Released

Today I have released ClusterEval 1.0. This program compares a clustering to a ground truth set of categories according to multiple different measures. It also includes a novel approach called 'Divergence from a Random Baseline' that augments existing measures to correct for ineffective clusterings. It has been used in the evaluation of clustering at the INEX XML Mining track at INEX in 2009 and 2010, and the upcoming Social Event Detection task at MediaEval in 2013. It implements cluster quality metrics based on ground truths such as Purity, Entropy, Negentropy, F1 and NMI.

Further details describing the use and functionality of this software are available in the manual.

Complete details of the quality measures can be found in the paper 'Document Clustering Evaluation: Divergence from a Random Baseline'.

The Social Event Detection task at MediaEval involves automated detection of social events from real life social networks. If this sounds of interest to you, head over the to the task description page and register.

Tuesday, 2 April 2013

The 2013 Social Event Detection Task

The task description for the Social Event Detection Task at MediaEval 2013 has been released. The task involves supervised clustering of events from real social media networks.

Some previous work on clustering evaluation that came from my involvement in the INEX XML Mining track in 2009 and 2010 and I described in a paper "Document Clustering Evaluation: Divergence from a Random Baseline" is being used during the evaluation.

If this sounds of interest to you, head over the task description page, and register!

Friday, 8 June 2012

RIP Spunky

I have known Spunky the dog since my family received him as a puppy 15 years ago. He was always happy and full of energy. I would often catch him smiling :)

You can see him chasing things in an earlier post.

He passed away today :(

Tuesday, 29 May 2012

Opening a TAR file in Python containing millions of files

I was recently opening at tarball containing millions of XML files. I got about half way through parsing the files and my VM ran out of memory and came to a grinding halt. Something was causing high memory usage. Knowing that Python has automatic memory management I considered the suspects; the tarfile and lxml modules I was using to process the data.

I initially thought this may be because I was not closing the buffer created by the TarFile.extractfile() function. This was not the case.

It turns out that the TarFile class keeps cached copies of information in a variable called members. This is an undocumented workaround that has been documented by Alexander Dutton. The workaround is to set the members variable to an empty list after processing each file.

import tarfile

tar ='large.tar.gz', 'r:gz')
for tarinfo in tar:
  # open the file from the archive as an in memory buffer
  buf = tar.extractfile(tarinfo)
  for line in buf:
    # do something with the line
  # free the cached data structures held by the TarFile object
  tar.members = []