Sunday 31 May 2015

Web Scale Document Clustering: Clustering 733 Million Web Pages

Document clustering analyses written language in unstructured text to place documents into topically related groups, clusters, or topics. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an unsupervised manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All semantic information is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together. Documents are often represented in the vector space model and similarity is compared using geometric measures such as Euclidean distance or cosine similarity.

I introduced an approach to document clustering using TopSig document signatures and K-tree in a previous post on large scale document clustering. This post highlights the progress that has been made since in the Parallel Streaming Signature EM-tree algorithm implemented in the LMW-tree C++ template library. It is now possible to cluster near 1 billion documents into near 1 million clusters on a single mid-range machine with 16 cores and 64GB of memory in under 1 day. I am not aware of any other approaches reporting this scale of document clustering, nevermind on a single machine. Other approaches of clustering lower dimensional, dense non-sparse vectors used 2,000 to 16,000 cores on large compute clusters. Many also produced a much smaller number of clusters.

TopSig is a particular model that allows efficient representation and comparison of similarity of natural language documents. TopSig extends random indexing to produce bit vectors representing documents. These bit vectors are then compared using Hamming distance to measure the similarity between documents. Random indexing is an incremental construction of a random projection. Every document can be indexed in isolation, leading to linear scalability for parallel and distributed implementations.

In the original TopSig paper I introduced an algorithm to cluster bit vectors directly. All cluster centers and points are bit vectors, which can be compared 64 bits at a time on modern processors. While document signatures have often been used for similarity search for tasks like near duplicate detection, few clustering algorithms work directly with these computationally efficient compressed representations. The approach lead to efficiency gain of 10 to 100 times over the traditional k-means algorithm implemented in C in CLUTO. When using 4096 bit vectors there was no reduction in cluster quality with respect to human generated categorizations. Furthermore, this was an unoptimized single threaded version in Java. There were certainly gains to be made via parallelization, low level optimization in a native language, streaming implementations, and tree based algorithms like K-tree or EM-tree. The result is the LMW-tree template library written in C++ and the Parallel Streaming Signature EM-tree algorithm recently presented at the 24th International World Wide Web Conference.

The Parrallel Streaming Signature EM-tree algorithm can scale in a distributed setting because the tree is immutable when performing inserts. Updates to the tree happen afterwards and are 5000 times faster than the insert step. The update operation is currently serial but could be parallelized if required. However, given Amdahl's law, there can be upto 5000 parallel inserts before the update operations becomes a bottleneck. Implementing a distributed algorithm is one direction for future work related to this research. I think it would be exciting to demonstrate the clustering of the entire searchable web of 50 billion web pages into millions of clusters or topics.

The EM-tree algorithm can cluster the ClueWeb12 web crawl 733 million documents into 600,000 clusters on a single mid-range machine. There are a vast diversity of topics available as natural language on the web, leading to the need for such a large number of clusters of topics. We also evaluated the quality of these clusterings using human relevance judgements for search queries and spam classifications. It highlighted that having a large number of clusters produces higher quality clusterings.

These large number of clusters can be used for many different tasks. Some include improving search quality and efficiency, automated link detection, document classification, and representation learning. Furthermore, such approaches can be applied to domains outside of natural language and information retrieval. The computer vision community have demonstrated the utility of highly scalable unsupervised learning algorithms in several pieces of research.

In summary, we were able to scale up document clustering to large scale data sets by using representations and algorithms that work well on modern computer architectures. It resulted in improving cluster quality by having a much larger model than previous approaches. For more details please refer to the Parallel Stream Signature EM-tree paper and the implementation in the LMW-tree library.


  1. *Not sure if my previous comment was submitted . But do you have any suggestions for a document clustering algorithm that might be close to the scale of Topsig but one in which the # of clusters is not specified.

    For example affinity propagation

    1. The Geomblog has an interest posting on estimating the number of clusters in a given dataset at

      You might also like to look at a previous publication of mine where I show how an exact optimum can be found to the elbow method when plotting number of clusters versus root mean squared error,

      I also think it would be quite feasible to incorporate an approach like X-means into these algorithms,