Perplexing Permutations

Cleanly terminating threads nested in processes in python

2023-12-27T06:55:00.000-08:00

The following code cleanly terminates threads nested inside processes in python. Processes are started using a ProcessPoolExecutor. To shutdown the pool cleanly, threads must be signaled to terminate via a threading Event. The SIGINT signal is captured for the parent and all child processes so that a KeyboardInterrupt exception is not thrown which would lead to threads or processes terminating in an unclean state after the user pressing ctrl + c and raising the SIGINT signal. All futures of processes containing threads are waited on to terminate using wait. Finally, the process pool can be shutdown clearnly using shutdown, now that all tasks submitted to the pool are complete.

import concurrent.futures
import multiprocessing
import os
import signal
import threading
import time

terminate = threading.Event()

def sigint(sig, frame):
    print(f"SIGINT received by {os.getpid()} -> setting terminate")
    terminate.set()

def thread_worker():
    tid = threading.get_ident()
    pid = os.getpid()
    while not terminate.is_set():
        print(f"sleeping in thread {tid} in process {pid}")
        time.sleep(1)
    print(f"{thread_worker} in thread {tid} in process {pid} finished cleanly")

def process_worker():
    signal.signal(signal.SIGINT, sigint)
    t = threading.Thread(target=thread_worker, daemon=True)
    t.start()
    t.join()
    print(f"{process_worker} in process {os.getpid()} finished cleanly")

if __name__ == "__main__":
    signal.signal(signal.SIGINT, sigint)
    with concurrent.futures.ProcessPoolExecutor(max_workers=2) as pool:
        futures = [pool.submit(process_worker) for _ in range(3)]
    print(futures)
    concurrent.futures.wait(futures, return_when=concurrent.futures.ALL_COMPLETED)
    pool.shutdown()
    print("parent exited cleanly")

This produces the following output when run and ctrl + c pressed after the second time the threads print to the console.

$ python3 pool.py
sleeping in thread 6155104256 in process 1788
sleeping in thread 6185136128 in process 1789
sleeping in thread 6155104256 in process 1788
sleeping in thread 6185136128 in process 1789
^CSIGINT received by 1789 -> setting terminate
SIGINT received by 1788 -> setting terminate
SIGINT received by 1786 -> setting terminate
<function thread_worker at 0x1025d0790> in thread 6155104256 in process 1788 finished cleanly
<function thread_worker at 0x10092c790> in thread 6185136128 in process 1789 finished cleanly
<function process_worker at 0x1025d0820> in process 1788 finished cleanly
<function process_worker at 0x10092c820> in process 1789 finished cleanly
<function thread_worker at 0x1025d0790> in thread 6155104256 in process 1788 finished cleanly
<function process_worker at 0x1025d0820> in process 1788 finished cleanly
[<Future at 0x10078ea60 state=finished returned NoneType>, <Future at 0x1007a85b0 state=finished returned NoneType>, <Future at 0x1007a8a00 state=finished returned NoneType>]
parent exited cleanly
$

Web Scale Document Clustering: Clustering 733 Million Web Pages

2015-05-31T04:01:00.000-07:00

Document clustering analyses written language in unstructured text to place documents into topically related groups, clusters, or topics. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an unsupervised manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All semantic information is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together. Documents are often represented in the vector space model and similarity is compared using geometric measures such as Euclidean distance or cosine similarity.

I introduced an approach to document clustering using TopSig document signatures and K-tree in a previous post on large scale document clustering. This post highlights the progress that has been made since in the Parallel Streaming Signature EM-tree algorithm implemented in the LMW-tree C++ template library. It is now possible to cluster near 1 billion documents into near 1 million clusters on a single mid-range machine with 16 cores and 64GB of memory in under 1 day. I am not aware of any other approaches reporting this scale of document clustering, nevermind on a single machine. Other approaches of clustering lower dimensional, dense non-sparse vectors used 2,000 to 16,000 cores on large compute clusters. Many also produced a much smaller number of clusters.

TopSig is a particular model that allows efficient representation and comparison of similarity of natural language documents. TopSig extends random indexing to produce bit vectors representing documents. These bit vectors are then compared using Hamming distance to measure the similarity between documents. Random indexing is an incremental construction of a random projection. Every document can be indexed in isolation, leading to linear scalability for parallel and distributed implementations.

In the original TopSig paper I introduced an algorithm to cluster bit vectors directly. All cluster centers and points are bit vectors, which can be compared 64 bits at a time on modern processors. While document signatures have often been used for similarity search for tasks like near duplicate detection, few clustering algorithms work directly with these computationally efficient compressed representations. The approach lead to efficiency gain of 10 to 100 times over the traditional k-means algorithm implemented in C in CLUTO. When using 4096 bit vectors there was no reduction in cluster quality with respect to human generated categorizations. Furthermore, this was an unoptimized single threaded version in Java. There were certainly gains to be made via parallelization, low level optimization in a native language, streaming implementations, and tree based algorithms like K-tree or EM-tree. The result is the LMW-tree template library written in C++ and the Parallel Streaming Signature EM-tree algorithm recently presented at the 24th International World Wide Web Conference.

The Parrallel Streaming Signature EM-tree algorithm can scale in a distributed setting because the tree is immutable when performing inserts. Updates to the tree happen afterwards and are 5000 times faster than the insert step. The update operation is currently serial but could be parallelized if required. However, given Amdahl's law, there can be upto 5000 parallel inserts before the update operations becomes a bottleneck. Implementing a distributed algorithm is one direction for future work related to this research. I think it would be exciting to demonstrate the clustering of the entire searchable web of 50 billion web pages into millions of clusters or topics.

The EM-tree algorithm can cluster the ClueWeb12 web crawl 733 million documents into 600,000 clusters on a single mid-range machine. There are a vast diversity of topics available as natural language on the web, leading to the need for such a large number of clusters of topics. We also evaluated the quality of these clusterings using human relevance judgements for search queries and spam classifications. It highlighted that having a large number of clusters produces higher quality clusterings.

These large number of clusters can be used for many different tasks. Some include improving search quality and efficiency, automated link detection, document classification, and representation learning. Furthermore, such approaches can be applied to domains outside of natural language and information retrieval. The computer vision community have demonstrated the utility of highly scalable unsupervised learning algorithms in several pieces of research.

In summary, we were able to scale up document clustering to large scale data sets by using representations and algorithms that work well on modern computer architectures. It resulted in improving cluster quality by having a much larger model than previous approaches. For more details please refer to the Parallel Stream Signature EM-tree paper and the implementation in the LMW-tree library.

Large Scale Document Clustering: Clustering and Searching 50 Million Web Pages

2014-10-26T10:33:00.000-07:00

Document clustering analyses written language in unstructured text to place documents into topically related groups or clusters. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an unsupervised manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All semantic information is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together.

The cluster hypothesis stated by van Rijsbergen in 1979, "asserts that two documents that are similar to each other have a high likelihood of being relevant to the same information need". As document clustering places similar documents in the same cluster, the cluster hypothesis supports the notion that only a small fraction of document clusters need to be searched to fulfil a users information need when submitting a query to a search engine.

A section on exploiting large scale document clustering and the cluster hypothesis to produce a more efficient search engine is available in my PhD thesis, and it is titled "Distributed information retrieval: Collection distribution, selection and the cluster hypothesis for evaluation of document clustering". The work brings together of two lines of related research. It builds upon the evaluation of document clustering started at the INEX XML Mining track. It also extends work with the K-tree data structure and algorithm, presenting for the first time the TopSig K-tree that works with binary document signatures produced by TopSig. It allows the 50 million web page ClueWeb09 Category B document collection used at the TREC Web Track to be clustered in 10 hours into approximately 140,000 clusters using a single thread of Java code. To the best of my knowledge such approaches clustering 50 million documents are not described in the literature without using sampling. There are many opportunities to reduce the time required to cluster via parallel and distributed processing, low level optimisation in a native programming language and using shorter document signatures.

The clusters produced by TopSig K-tree have been used with a new cluster ranking approach based upon Okapi BM25 that combines document weights to represent clusters. I have called it CBM625 as it squares BM25 term-document weights and combines them to rank clusters. The final result is that this approach is able to search 13 fold less documents than the previous best reported approach on the ClueWeb09 Category B 50 million document collection. The theoretical clustering evaluation at INEX suggested that fine grained clusters allow better ranking of clusters for collection selection. It used relevance judgements to place documents relevant to a query in an optimal order with respect to a document clustering, which represents an upper bound for any collection selection approach given the same clustering of documents. These new experimental results demonstrate the effectiveness of fine grain document clustering using a large scale clustering algorithm that is able to produce approximately 140,000 clusters, a collection selection approach and a final ranking of documents by a search engine. The results were evaluated using queries 1-50 from the TREC 2009 Web Track. Only the first 8 most highly ranked of the total 140,000 document clusters need to be searched to ensure there is no statistically significant difference in retrieval quality.

I have made the software and some of the data available from this paper at http://sourceforge.net/projects/ktree/files/docclust_ir/.

All of the software required to replicate the experiments is available is contained is docclust_ir.tar.gz. This includes the ATIRE search engine, TopSig search engine, the version of TopSig K-tree as described in the paper, and the collection selection approach that ranks documents using the clusterings produced by TopSig K-tree. It also includes the scripts to run everything. This code is undocumented, messy, rushed, research code but I am happy to help you with any problems you run into using it. You will also need to obtain the INEX 2009 XML Wikipedia and the ClueWeb09 Category B document collections.

Furthermore, the clusters of the INEX 2009 Wikipedia and ClueWeb09 Category B are available and the document signatures used to create the clusterings.

This brings together 4 years worth of research on document clustering evaluation and algorithms. I wish to thank everyone that has supported me along the way!

Bash oneliner - say random quotes from the internet at random intervals

2014-05-17T06:56:00.002-07:00

Here is some weekend fun. Speaking random quotes from the internet at random intervals. Only tested on my Macbook. Not sure if there is the say utility for other platforms.

while true; do say `curl -s http://www.quotedb.com/quote/quote.php?action=random_quote | head -n 1 | sed s/document.write$\'//g | sed s/\<br\>\'$\;//g`; sleep $((RANDOM%60)); done

Minimal Test Collection (MTC) Evaluation Utility

2013-06-23T19:28:00.001-07:00

I have been using mtc-eval from the TREC 2009 Web Track homepage and I had troubles getting it to run without it crashing with segmentation faults. I found a newer version at the author's web page and it fixed any problems I was experiencing. Also, the GNU Scientific Library that this software depends on will install without the LAPACK and BLAS dependencies. So remember to install the lapack, lapack-devel, atlas, atlas-devel, blas and blas-devel packages found on most Linux distributions.

ClusterEval 1.0 Released

2013-06-15T21:11:00.001-07:00

Today I have released ClusterEval 1.0. This program compares a clustering to a ground truth set of categories according to multiple different measures. It also includes a novel approach called 'Divergence from a Random Baseline' that augments existing measures to correct for ineffective clusterings. It has been used in the evaluation of clustering at the INEX XML Mining track at INEX in 2009 and 2010, and the upcoming Social Event Detection task at MediaEval in 2013. It implements cluster quality metrics based on ground truths such as Purity, Entropy, Negentropy, F1 and NMI.

Further details describing the use and functionality of this software are available in the manual.

Complete details of the quality measures can be found in the paper 'Document Clustering Evaluation: Divergence from a Random Baseline'.

The Social Event Detection task at MediaEval involves automated detection of social events from real life social networks. If this sounds of interest to you, head over the to the task description page and register.

The 2013 Social Event Detection Task

2013-04-02T16:16:00.000-07:00

The task description for the Social Event Detection Task at MediaEval 2013 has been released. The task involves supervised clustering of events from real social media networks.

Some previous work on clustering evaluation that came from my involvement in the INEX XML Mining track in 2009 and 2010 and I described in a paper "Document Clustering Evaluation: Divergence from a Random Baseline" is being used during the evaluation.

If this sounds of interest to you, head over the task description page, and register!

RIP Spunky

2012-06-08T00:01:00.003-07:00

I have known Spunky the dog since my family received him as a puppy 15 years ago. He was always happy and full of energy. I would often catch him smiling :)

You can see him chasing things in an earlier post.

He passed away today :(

Opening a TAR file in Python containing millions of files

2012-05-29T05:00:00.001-07:00

I was recently opening at tarball containing millions of XML files. I got about half way through parsing the files and my VM ran out of memory and came to a grinding halt. Something was causing high memory usage. Knowing that Python has automatic memory management I considered the suspects; the tarfile and lxml modules I was using to process the data.

I initially thought this may be because I was not closing the buffer created by the TarFile.extractfile() function. This was not the case.

It turns out that the TarFile class keeps cached copies of information in a variable called members. This is an undocumented workaround that has been documented by Alexander Dutton. The workaround is to set the members variable to an empty list after processing each file.

import tarfile

tar = tarfile.open('large.tar.gz', 'r:gz')
for tarinfo in tar:
# open the file from the archive as an in memory buffer
buf = tar.extractfile(tarinfo)

for line in buf:
# do something with the line
process(line)

# free the cached data structures held by the TarFile object
tar.members = []

Alt key not working in Mac OS X terminal

2012-05-14T23:45:00.003-07:00

I have had trouble switching windows in irssi (alt + 1, alt +2, ... , alt + n) since I have been using my macbook more often. It turns out that the OS X terminal application rebinds the alt key for shortcuts.

It is possible to rebind the alt key so it will work as expected.

To do this go to,
Terminal > Preferences > Settings > Keyboard

and select 'Use option as meta key' and then the alt key will be passed through to the terminal.

Barebones .vimrc

2011-09-28T04:36:00.000-07:00

My barebones .vimrc. Nothing fancy. Just syntax highlighting, auto indenting and tabbing, mouse and an 80 column marker.

syn on
set softtabstop=4
set shiftwidth=4
set tabstop=4
set expandtab
set smarttab
set autoindent
set mouse=a
set textwidth=80
set colorcolumn=+1
hi ColorColumn ctermbg=darkgrey guibg=darkgrey

Spunky Chasing Things

2011-02-19T09:23:00.000-08:00

Spunky the dog chasing things.

Bridge to Bridge

2010-07-04T08:51:00.000-07:00

I rode along the Brisbane river from the bridge at Indooroopilly to the Gateway bridge and back.

Date:26/06/2010 7:22 am
Distance:65.0 kilometers
Elapsed Time:3:20:22
Avg. Speed:19.5 km/h
Max. Speed:56.3 km/h

View Larger Map

Hyperref and Cleveref LaTeX Conflict

2010-06-25T08:34:00.000-07:00

I ran into a conflict between two LaTeX packages when writing my thesis. Both the hyperref and cleveref are useful packages for automatically dealing with references in a document.

Hyperref is a TeX package for making documents with live links in PDF and HTML output formats. This places automatic links for \ref{} and \cite{} commands into the final PDF document. This allows readers to simply click on a reference to a Table, Figure, Equation or Citation and be taken to its location in the document.

The cleveref package enhances LaTeX's cross-referencing features, allowing the format of references to be determined automatically according to the type of reference. This is almost like type inference found in modern programming languages. When using the \ref{} command in LaTeX, one typically mentions the type of reference in the text and then \ref{} provides the number for the reference. For example, Table \ref{table:x2} shows the values of x^2 for integers 1 through 10. Using cleveref one can omit the leading Table and the package automatically infers it from the reference. For example, \Cref{table:x2} shows the values of x^2 for integers 1 through 10.

I used cleveref throughout my entire thesis and I included the hyperref package at the end to add PDF links. Unfortunately this somehow manages to conflict with cleveref and it no longer works as expected. The following demonstrates what happens. The caption of the Figure is inserted into the reference.

I ended up not using the cleveref package for this reason. I am not a LaTeX expert when it comes writing packages, so until I can dedicate some time to learn what is going on I have no working solution.

I used the TeXLive packages from the Ubuntu repository which contains the hyperref package. I downloaded cleveref from the CTAN repository.

A Crash Course on Modern Hardware

2010-01-18T07:36:00.000-08:00

Cliff Click from Azul Systems gives a talk on one of my favourite subjects, Computer Architecture. He looks at the difficulty of predicting performance on modern x86 CPUs.

Ride to Portside

2010-01-12T03:08:00.000-08:00

I rode to Portside this morning to try out MotionX GPS on the iPhone. I found a couple of weird signs along the way.

Date:12/01/2010 8:49 am
Distance:22.0 kilometers
Elapsed Time:1:12:46
Avg. Speed:18.1 km/h
Max. Speed:44.5 km/h

View Larger Map

Turing's Cathedral

2010-01-10T13:37:00.000-08:00

Turing's Cathedral is an interesting look at computing at the 60th anniversary of John von Neumann's proposal for the digital computer. It cover aspects of computational models, biology, AI and the growing wealth of knowledge on the internet.

The quotes below show how we have far exceeded the original expectations of computation. Even though we are still programming the von Nuemann architecture / Turing machines, I have always wondered how much the languages we use would change given an entirely different computational model.

"When the machine finally became operational in 1951, it had 5 kilobytes of random-access memory: a 32 x 32 x 40 matrix of binary digits, stored as a flickering pattern of electrical charge, shifting from millisecond to millisecond on the surface of 40 cathode-ray tubes."

"By breaking the distinction between numbers that mean things and numbers that do things, von Neumann unleashed the power of the stored-program computer, and our universe would never be the same."

"In the early 1950s, when mean time between memory failure was measured in minutes, no one imagined that a system depending on every bit being in exactly the right place at exactly the right time could be scaled up by a factor of 10^13 in size, and down by a factor of 10^6 in time. Von Neumann, who died prematurely in 1957, became increasingly interested in understanding how biology has managed (and how technology might manage) to construct reliable organisms out of unreliable parts. He believed the von Neumann architecture would soon be replaced by something else. Even if codes could be completely debugged, million-cell memories could never be counted upon, digitally, to behave consistently from one kilocycle to the next."

"As organisms, we possess two outstanding repositories of information: the information conveyed by our genes, and the information stored in our brains. Both of these are based upon non-von-Neumann architectures, and it is no surprise that Von Neumann became fascinated with these examples as he left his chairmanship of the AEC (where he had succeeded Lewis Strauss) and began to lay out the research agenda that cancer prevented him from following up."

"We can divide the computational universe into three sectors: computable problems; non-computable problems (that can be given a finite, exact description but have no effective procedure to deliver a definite result); and, finally, questions whose answers are, in principle, computable, but that, in practice, we are unable to ask in unambiguous language that computers can understand."

USA

2009-12-16T03:29:00.000-08:00

Earlier this year I travelled to the USA for SIGIR 2009. I visited New York City, Boston, Santa Fe, Los Alamos and Taos. I have made another travel map!

View Larger Map

Long Distance

2009-10-26T15:40:00.000-07:00

Q: Why did the programmer call his mother long distance?
A: Because that was her name.

Computing Then

2009-06-21T01:56:00.000-07:00

Computing Then is an interesting look into the past of computing. With much focus on the "now" of computing and electronics, this is a nice break. I always enjoy the 16 & 32 years ago column in the IEEE Computer publication. I remember 16 years ago quite vividly but alas I am not 32 yet. The older I get, the less powers of 2 I get to experience. I doubt I will make 2^7.

Adaptive Compression of the Dynamic Range

2009-05-26T04:02:00.000-07:00

Dynamic range compression is a technique used by audio engineers to optimise the distribution of frequencies in a mix. In popular music it is often abused, resulting in a flat and overly loud sound. An audio engineer applies some static parameters of threshold, knee, compression ratio, attack, delay and gain based on the typical listening environment of their anticipated user. This is why you can hear your favourite top 40 track in a noisy environment but you can barely hear classical music at the same volume level on your audio device. Therefore, I propose that these parameters should adapt to the environment of the user. If it is noisy, the compression ratio is pushed up and if it is quiet it can be relaxed. By intuition, I assume that this would be a relatively easy solution to solve with machine learning. My simple explanation with one of the parameters is trivial. I am sure more complex relationships exist between the parameters and user satisfaction. If someone could embed this in a popular music player with the correct audio source, it could be a winning combination. However, this requires an unmastered, or a minimally mastered audio source.

Dusty

2009-02-14T09:31:00.000-08:00

This dust storm in Broken Hill just blew my mind. Looks so awesome!

Europe

2009-01-24T09:18:00.000-08:00

I recently traveled to Europe to attend INEX.

View Larger Map

500 days

2009-01-23T23:06:00.000-08:00

# uptime
06:01:30 up 509 days, 6:12, 1 user, load average: 0.00, 0.00, 0.00

Now that I have passed the 500 day mark I can bring myself to install a new distro. Time to say good bye to Slackware after 14 years and replace it with noobuntu server. Package management just seems easier. This is my last box with Slackware still on it.

K-tree, NMF and INEX 2008

2008-10-28T18:44:00.000-07:00

Today I gave a presentation within my research group at QUT. It discusses the submissions I made for the XML Mining track at INEX 2008. This required classifying documents based on previously known examples (classification). Another task required grouping similar documents together with no prior information other than the documents themselves (clustering). I also looked at different ways to measure cluster quality using negentropy and document link graphs. The K-tree algorithm is part of my research. This is the first time it was applied to document clustering. The results for the entire track should be out soon. I will also be giving the presentation at the QLD IEEE Computational Intelligence Symposium.