Wednesday 27 December 2023

Cleanly terminating threads nested in processes in python

The following code cleanly terminates threads nested inside processes in python. Processes are started using a ProcessPoolExecutor. To shutdown the pool cleanly, threads must be signaled to terminate via a threading Event. The SIGINT signal is captured for the parent and all child processes so that a KeyboardInterrupt exception is not thrown which would lead to threads or processes terminating in an unclean state after the user pressing ctrl + c and raising the SIGINT signal. All futures of processes containing threads are waited on to terminate using wait. Finally, the process pool can be shutdown clearnly using shutdown, now that all tasks submitted to the pool are complete.

import concurrent.futures
import multiprocessing
import os
import signal
import threading
import time

terminate = threading.Event()

def sigint(sig, frame):
    print(f"SIGINT received by {os.getpid()} -> setting terminate")
    terminate.set()

def thread_worker():
    tid = threading.get_ident()
    pid = os.getpid()
    while not terminate.is_set():
        print(f"sleeping in thread {tid} in process {pid}")
        time.sleep(1)
    print(f"{thread_worker} in thread {tid} in process {pid} finished cleanly")

def process_worker():
    signal.signal(signal.SIGINT, sigint)
    t = threading.Thread(target=thread_worker, daemon=True)
    t.start()
    t.join()
    print(f"{process_worker} in process {os.getpid()} finished cleanly")

if __name__ == "__main__":
    signal.signal(signal.SIGINT, sigint)
    with concurrent.futures.ProcessPoolExecutor(max_workers=2) as pool:
        futures = [pool.submit(process_worker) for _ in range(3)]
    print(futures)
    concurrent.futures.wait(futures, return_when=concurrent.futures.ALL_COMPLETED)
    pool.shutdown()
    print("parent exited cleanly")

This produces the following output when run and ctrl + c pressed after the second time the threads print to the console.

$ python3 pool.py
sleeping in thread 6155104256 in process 1788
sleeping in thread 6185136128 in process 1789
sleeping in thread 6155104256 in process 1788
sleeping in thread 6185136128 in process 1789
^CSIGINT received by 1789 -> setting terminate
SIGINT received by 1788 -> setting terminate
SIGINT received by 1786 -> setting terminate
<function thread_worker at 0x1025d0790> in thread 6155104256 in process 1788 finished cleanly
<function thread_worker at 0x10092c790> in thread 6185136128 in process 1789 finished cleanly
<function process_worker at 0x1025d0820> in process 1788 finished cleanly
<function process_worker at 0x10092c820> in process 1789 finished cleanly
<function thread_worker at 0x1025d0790> in thread 6155104256 in process 1788 finished cleanly
<function process_worker at 0x1025d0820> in process 1788 finished cleanly
[<Future at 0x10078ea60 state=finished returned NoneType>, <Future at 0x1007a85b0 state=finished returned NoneType>, <Future at 0x1007a8a00 state=finished returned NoneType>]
parent exited cleanly
$

Sunday 31 May 2015

Web Scale Document Clustering: Clustering 733 Million Web Pages

Document clustering analyses written language in unstructured text to place documents into topically related groups, clusters, or topics. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an unsupervised manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All semantic information is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together. Documents are often represented in the vector space model and similarity is compared using geometric measures such as Euclidean distance or cosine similarity.

I introduced an approach to document clustering using TopSig document signatures and K-tree in a previous post on large scale document clustering. This post highlights the progress that has been made since in the Parallel Streaming Signature EM-tree algorithm implemented in the LMW-tree C++ template library. It is now possible to cluster near 1 billion documents into near 1 million clusters on a single mid-range machine with 16 cores and 64GB of memory in under 1 day. I am not aware of any other approaches reporting this scale of document clustering, nevermind on a single machine. Other approaches of clustering lower dimensional, dense non-sparse vectors used 2,000 to 16,000 cores on large compute clusters. Many also produced a much smaller number of clusters.

TopSig is a particular model that allows efficient representation and comparison of similarity of natural language documents. TopSig extends random indexing to produce bit vectors representing documents. These bit vectors are then compared using Hamming distance to measure the similarity between documents. Random indexing is an incremental construction of a random projection. Every document can be indexed in isolation, leading to linear scalability for parallel and distributed implementations.

In the original TopSig paper I introduced an algorithm to cluster bit vectors directly. All cluster centers and points are bit vectors, which can be compared 64 bits at a time on modern processors. While document signatures have often been used for similarity search for tasks like near duplicate detection, few clustering algorithms work directly with these computationally efficient compressed representations. The approach lead to efficiency gain of 10 to 100 times over the traditional k-means algorithm implemented in C in CLUTO. When using 4096 bit vectors there was no reduction in cluster quality with respect to human generated categorizations. Furthermore, this was an unoptimized single threaded version in Java. There were certainly gains to be made via parallelization, low level optimization in a native language, streaming implementations, and tree based algorithms like K-tree or EM-tree. The result is the LMW-tree template library written in C++ and the Parallel Streaming Signature EM-tree algorithm recently presented at the 24th International World Wide Web Conference.

The Parrallel Streaming Signature EM-tree algorithm can scale in a distributed setting because the tree is immutable when performing inserts. Updates to the tree happen afterwards and are 5000 times faster than the insert step. The update operation is currently serial but could be parallelized if required. However, given Amdahl's law, there can be upto 5000 parallel inserts before the update operations becomes a bottleneck. Implementing a distributed algorithm is one direction for future work related to this research. I think it would be exciting to demonstrate the clustering of the entire searchable web of 50 billion web pages into millions of clusters or topics.

The EM-tree algorithm can cluster the ClueWeb12 web crawl 733 million documents into 600,000 clusters on a single mid-range machine. There are a vast diversity of topics available as natural language on the web, leading to the need for such a large number of clusters of topics. We also evaluated the quality of these clusterings using human relevance judgements for search queries and spam classifications. It highlighted that having a large number of clusters produces higher quality clusterings.

These large number of clusters can be used for many different tasks. Some include improving search quality and efficiency, automated link detection, document classification, and representation learning. Furthermore, such approaches can be applied to domains outside of natural language and information retrieval. The computer vision community have demonstrated the utility of highly scalable unsupervised learning algorithms in several pieces of research.

In summary, we were able to scale up document clustering to large scale data sets by using representations and algorithms that work well on modern computer architectures. It resulted in improving cluster quality by having a much larger model than previous approaches. For more details please refer to the Parallel Stream Signature EM-tree paper and the implementation in the LMW-tree library.

Sunday 26 October 2014

Large Scale Document Clustering: Clustering and Searching 50 Million Web Pages

Document clustering analyses written language in unstructured text to place documents into topically related groups or clusters. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an unsupervised manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All semantic information is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together.

The cluster hypothesis stated by van Rijsbergen in 1979, "asserts that two documents that are similar to each other have a high likelihood of being relevant to the same information need". As document clustering places similar documents in the same cluster, the cluster hypothesis supports the notion that only a small fraction of document clusters need to be searched to fulfil a users information need when submitting a query to a search engine.

A section on exploiting large scale document clustering and the cluster hypothesis to produce a more efficient search engine is available in my PhD thesis, and it is titled "Distributed information retrieval: Collection distribution, selection and the cluster hypothesis for evaluation of document clustering". The work brings together of two lines of related research. It builds upon the evaluation of document clustering started at the INEX XML Mining track. It also extends work with the K-tree data structure and algorithm, presenting for the first time the TopSig K-tree that works with binary document signatures produced by TopSig. It allows the 50 million web page ClueWeb09 Category B document collection used at the TREC Web Track to be clustered in 10 hours into approximately 140,000 clusters using a single thread of Java code. To the best of my knowledge such approaches clustering 50 million documents are not described in the literature without using sampling. There are many opportunities to reduce the time required to cluster via parallel and distributed processing, low level optimisation in a native programming language and using shorter document signatures.

The clusters produced by TopSig K-tree have been used with a new cluster ranking approach based upon Okapi BM25 that combines document weights to represent clusters. I have called it CBM625 as it squares BM25 term-document weights and combines them to rank clusters. The final result is that this approach is able to search 13 fold less documents than the previous best reported approach on the ClueWeb09 Category B 50 million document collection. The theoretical clustering evaluation at INEX suggested that fine grained clusters allow better ranking of clusters for collection selection.  It used relevance judgements to place documents relevant to a query in an optimal order with respect to a document clustering, which represents an upper bound for any collection selection approach given the same clustering of documents. These new experimental results demonstrate the effectiveness of fine grain document clustering using a large scale clustering algorithm that is able to produce approximately 140,000 clusters, a collection selection approach and a final ranking of documents by a search engine. The results were evaluated using queries 1-50 from the TREC 2009 Web Track. Only the first 8 most highly ranked of the total 140,000 document clusters need to be searched to ensure there is no statistically significant difference in retrieval quality.

I have made the software and some of the data available from this paper at http://sourceforge.net/projects/ktree/files/docclust_ir/.

All of the software required to replicate the experiments is available is contained is docclust_ir.tar.gz. This includes the ATIRE search engine, TopSig search engine, the version of TopSig K-tree as described in the paper, and the collection selection approach that ranks documents using the clusterings produced by TopSig K-tree. It also includes the scripts to run everything. This code is undocumented, messy, rushed, research code but I am happy to help you with any problems you run into using it. You will also need to obtain the INEX 2009 XML Wikipedia and the ClueWeb09 Category B document collections.

Furthermore, the clusters of the INEX 2009 Wikipedia and ClueWeb09 Category B are available and the document signatures used to create the clusterings.

This brings together 4 years worth of research on document clustering evaluation and algorithms. I wish to thank everyone that has supported me along the way!

Saturday 17 May 2014

Bash oneliner - say random quotes from the internet at random intervals

Here is some weekend fun. Speaking random quotes from the internet at random intervals. Only tested on my Macbook. Not sure if there is the say utility for other platforms.


while true; do say `curl -s http://www.quotedb.com/quote/quote.php?action=random_quote | head -n 1 | sed s/document.write\(\'//g | sed s/\<br\>\'\)\;//g`; sleep $((RANDOM%60)); done

Sunday 23 June 2013

Minimal Test Collection (MTC) Evaluation Utility

I have been using mtc-eval from the TREC 2009 Web Track homepage and I had troubles getting it to run without it crashing with segmentation faults. I found a newer version at the author's web page and it fixed any problems I was experiencing. Also, the GNU Scientific Library that this software depends on will install without the LAPACK and BLAS dependencies. So remember to install the lapack, lapack-devel, atlas, atlas-devel, blas and blas-devel packages found on most Linux distributions.

Saturday 15 June 2013

ClusterEval 1.0 Released

Today I have released ClusterEval 1.0. This program compares a clustering to a ground truth set of categories according to multiple different measures. It also includes a novel approach called 'Divergence from a Random Baseline' that augments existing measures to correct for ineffective clusterings. It has been used in the evaluation of clustering at the INEX XML Mining track at INEX in 2009 and 2010, and the upcoming Social Event Detection task at MediaEval in 2013. It implements cluster quality metrics based on ground truths such as Purity, Entropy, Negentropy, F1 and NMI.

Further details describing the use and functionality of this software are available in the manual.

Complete details of the quality measures can be found in the paper 'Document Clustering Evaluation: Divergence from a Random Baseline'.

The Social Event Detection task at MediaEval involves automated detection of social events from real life social networks. If this sounds of interest to you, head over the to the task description page and register.

Tuesday 2 April 2013

The 2013 Social Event Detection Task

The task description for the Social Event Detection Task at MediaEval 2013 has been released. The task involves supervised clustering of events from real social media networks.

Some previous work on clustering evaluation that came from my involvement in the INEX XML Mining track in 2009 and 2010 and I described in a paper "Document Clustering Evaluation: Divergence from a Random Baseline" is being used during the evaluation.

If this sounds of interest to you, head over the task description page, and register!