tag:blogger.com,1999:blog-7660665277915347002024-03-04T21:54:03.626-08:00Perplexing Permutationsthe musings of Chris de Vrieschrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.comBlogger42125tag:blogger.com,1999:blog-766066527791534700.post-76882850269161169602023-12-27T06:55:00.000-08:002023-12-27T06:59:16.159-08:00Cleanly terminating threads nested in processes in python<p>The following code cleanly terminates threads nested inside processes in python. Processes are started using a <a href="https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor">ProcessPoolExecutor</a>. To shutdown the pool cleanly, threads must be signaled to terminate via a threading <a href="https://docs.python.org/3/library/threading.html#event-objects">Event</a>. The SIGINT signal is captured for the parent and all child processes so that a <a href="https://docs.python.org/3/library/exceptions.html#KeyboardInterrupt">KeyboardInterrupt</a> exception is not thrown which would lead to threads or processes terminating in an unclean state after the user pressing ctrl + c and raising the SIGINT <a href="https://en.wikipedia.org/wiki/Signal_(IPC)">signal</a>. All <a href="https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Future">futures</a> of processes containing threads are waited on to terminate using <a href="https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.wait">wait</a>. Finally, the process pool can be shutdown clearnly using <a href="https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.shutdown">shutdown</a>, now that all tasks <a href="https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.submit">submitted</a> to the pool are complete.<br /></p><p><span style="font-family: courier;">import concurrent.futures<br />import multiprocessing<br />import os<br />import signal<br />import threading<br />import time<br /><br />terminate = threading.Event()<br /><br />def sigint(sig, frame):<br /> print(f"SIGINT received by {os.getpid()} -> setting terminate")<br /> terminate.set()<br /><br />def thread_worker():<br /> tid = threading.get_ident()<br /> pid = os.getpid()<br /> while not terminate.is_set():<br /> print(f"sleeping in thread {tid} in process {pid}")<br /> time.sleep(1)<br /> print(f"{thread_worker} in thread {tid} in process {pid} finished cleanly")<br /><br />def process_worker():<br /> signal.signal(signal.SIGINT, sigint)<br /> t = threading.Thread(target=thread_worker, daemon=True)<br /> t.start()<br /> t.join()<br /> print(f"{process_worker} in process {os.getpid()} finished cleanly")<br /><br />if __name__ == "__main__":<br /> signal.signal(signal.SIGINT, sigint)<br /> with concurrent.futures.ProcessPoolExecutor(max_workers=2) as pool:<br /> futures = [pool.submit(process_worker) for _ in range(3)]<br /> print(futures)<br /> concurrent.futures.wait(futures, return_when=concurrent.futures.ALL_COMPLETED)<br /> pool.shutdown()<br /> print("parent exited cleanly")</span></p><p>This produces the following output when run and ctrl + c pressed after the second time the threads print to the console.</p><p><span style="font-family: courier;">$ python3 pool.py<br />sleeping in thread 6155104256 in process 1788<br />sleeping in thread 6185136128 in process 1789<br />sleeping in thread 6155104256 in process 1788<br />sleeping in thread 6185136128 in process 1789<br />^CSIGINT received by 1789 -> setting terminate<br />SIGINT received by 1788 -> setting terminate<br />SIGINT received by 1786 -> setting terminate<br /><function thread_worker at 0x1025d0790> in thread 6155104256 in process 1788 finished cleanly<br /><function thread_worker at 0x10092c790> in thread 6185136128 in process 1789 finished cleanly<br /><function process_worker at 0x1025d0820> in process 1788 finished cleanly<br /><function process_worker at 0x10092c820> in process 1789 finished cleanly<br /><function thread_worker at 0x1025d0790> in thread 6155104256 in process 1788 finished cleanly<br /><function process_worker at 0x1025d0820> in process 1788 finished cleanly<br />[<Future at 0x10078ea60 state=finished returned NoneType>, <Future at 0x1007a85b0 state=finished returned NoneType>, <Future at 0x1007a8a00 state=finished returned NoneType>]<br />parent exited cleanly<br />$</span><br /></p>chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-83233188981751950742015-05-31T04:01:00.000-07:002015-05-31T04:13:49.076-07:00Web Scale Document Clustering: Clustering 733 Million Web Pages<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://en.wikipedia.org/wiki/Document_clustering" target="_blank">Document clustering</a> analyses written language in unstructured text to place documents into topically related groups, clusters, or topics. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an <a href="http://en.wikipedia.org/wiki/Unsupervised_learning" target="_blank">unsupervised</a> manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All <a href="http://en.wikipedia.org/wiki/Semantic_memory" target="_blank">semantic information</a> is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together. Documents are often represented in the <a href="http://en.wikipedia.org/wiki/Vector_space_model" target="_blank">vector space model</a> and similarity is compared using geometric measures such as <a href="http://en.wikipedia.org/wiki/Euclidean_distance" target="_blank">Euclidean distance</a> or <a href="http://en.wikipedia.org/wiki/Cosine_similarity" target="_blank">cosine similarity</a>.<br />
<br />
I introduced an approach to document clustering using <a href="http://eprints.qut.edu.au/43451/" target="_blank">TopSig</a> document signatures and K-tree in a previous post on <a href="http://chris.de-vries.id.au/2013/07/large-scale-document-clustering.html" target="_blank">large scale document clustering</a>. This post highlights the progress that has been made since in the <a href="http://eprints.qut.edu.au/84386/" target="_blank">Parallel Streaming Signature EM-tree</a> algorithm implemented in the <a href="http://lmwtree.devries.ninja/" target="_blank">LMW-tree</a> C++ template library. It is now possible to cluster near 1 billion documents into near 1 million clusters on a single mid-range machine with 16 cores and 64GB of memory in under 1 day. I am not aware of any other approaches reporting this scale of document clustering, nevermind on a single machine. Other approaches of clustering lower dimensional, dense non-sparse vectors used 2,000 to 16,000 cores on large compute clusters. Many also produced a much smaller number of clusters.<br />
<br />
<a href="http://eprints.qut.edu.au/43451/" target="_blank">TopSig</a> is a particular model that allows efficient representation and comparison of similarity of natural language documents. TopSig extends <a href="http://en.wikipedia.org/wiki/Random_indexing" target="_blank">random indexing</a> to produce bit vectors representing documents. These bit vectors are then compared using <a href="http://en.wikipedia.org/wiki/Hamming_distance" target="_blank">Hamming distance</a> to measure the similarity between documents. Random indexing is an incremental construction of a <a href="http://en.wikipedia.org/wiki/Random_projection" target="_blank">random projection</a>. Every document can be indexed in isolation, leading to linear scalability for parallel and distributed implementations.<br />
<div>
<br /></div>
In the original TopSig <a href="http://eprints.qut.edu.au/43451/" target="_blank">paper</a> I introduced an algorithm to cluster bit vectors directly. All cluster centers and points are bit vectors, which can be compared 64 bits at a time on modern processors. While document signatures have often been used for similarity search for tasks like near duplicate detection, few clustering algorithms work directly with these computationally efficient compressed representations. The approach lead to efficiency gain of 10 to 100 times over the traditional k-means algorithm implemented in C in <a href="http://glaros.dtc.umn.edu/gkhome/views/cluto" target="_blank">CLUTO</a>. When using 4096 bit vectors there was no reduction in cluster quality with respect to human generated categorizations. Furthermore, this was an unoptimized single threaded version in Java. There were certainly gains to be made via parallelization, low level optimization in a native language, streaming implementations, and tree based algorithms like K-tree or EM-tree. The result is the <a href="http://lmwtree.devries.ninja/" target="_blank">LMW-tree</a> template library written in C++ and the <a href="http://eprints.qut.edu.au/84386/" target="_blank">Parallel Streaming Signature EM-tree</a> algorithm recently presented at the <a href="http://www.www2015.it/accepted-papers/" target="_blank">24th International World Wide Web Conference</a>.<br />
<br />
The <a href="http://eprints.qut.edu.au/84386/" target="_blank">Parrallel Streaming Signature EM-tree</a> algorithm can scale in a distributed setting because the tree is immutable when performing inserts. Updates to the tree happen afterwards and are 5000 times faster than the insert step. The update operation is currently serial but could be parallelized if required. However, given <a href="http://en.wikipedia.org/wiki/Amdahl's_law" target="_blank">Amdahl's law</a>, there can be upto 5000 parallel inserts before the update operations becomes a bottleneck. Implementing a <a href="https://github.com/cmdevries/LMW-tree/issues/5" target="_blank">distributed algorithm</a> is one direction for future work related to this research. I think it would be exciting to demonstrate the clustering of the entire searchable web of 50 billion web pages into millions of clusters or topics.<br />
<br />
The EM-tree algorithm can cluster the <a href="http://www.lemurproject.org/clueweb12.php/" target="_blank">ClueWeb12</a> web crawl 733 million documents into 600,000 clusters on a single mid-range machine. There are a vast diversity of topics available as natural language on the web, leading to the need for such a large number of clusters of topics. We also evaluated the quality of these clusterings using human relevance judgements for search queries and spam classifications. It highlighted that having a large number of clusters produces higher quality clusterings.<br />
<br />
These large number of clusters can be used for many different tasks. Some include improving search quality and efficiency, automated link detection, document classification, and representation learning. Furthermore, such approaches can be applied to domains outside of natural language and information retrieval. The computer vision community have demonstrated the utility of highly scalable unsupervised learning algorithms in <a href="http://arxiv.org/pdf/1112.6209.pdf" target="_blank">several</a> pieces of <a href="http://papers.nips.cc/paper/4497-emergence-of-object-selective-features-in-unsupervised-feature-learning" target="_blank">research</a>.<br />
<br />
In summary, we were able to scale up document clustering to large scale data sets by using representations and algorithms that work well on modern computer architectures. It resulted in improving cluster quality by having a much larger model than previous approaches. For more details please refer to the <a href="http://eprints.qut.edu.au/84386/" target="_blank">Parallel Stream Signature EM-tree paper</a> and the <a href="https://github.com/cmdevries/LMW-tree/blob/master/src/lmw/StreamingEMTree.h" target="_blank">implementation in the LMW-tree library</a>.</div>
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com2tag:blogger.com,1999:blog-766066527791534700.post-86179897681280097342014-10-26T10:33:00.000-07:002014-11-22T22:45:20.592-08:00Large Scale Document Clustering: Clustering and Searching 50 Million Web Pages<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://en.wikipedia.org/wiki/Document_clustering" target="_blank">Document clustering</a> analyses written language in unstructured text to place documents into topically related groups or clusters. Documents such as web pages are automatically grouped together by a computer program so that pages talking about the same concepts are in the same cluster and those talking about different concepts are in different clusters. This is performed in an <a href="http://en.wikipedia.org/wiki/Unsupervised_learning" target="_blank">unsupervised</a> manner where there is no manual labelling of the documents for these concepts, topics or other semantic information. All <a href="http://en.wikipedia.org/wiki/Semantic_memory" target="_blank">semantic information</a> is derived from the documents themselves. The core concept that allows this to happen is the definition of a similarity between two documents. An algorithm uses this similarity measure and optimises it so that the most similar documents are placed together.<br />
<br />
The <a href="http://en.wikipedia.org/wiki/Relevance_(information_retrieval)#Clustering_and_relevance" target="_blank">cluster hypothesis</a> stated by van Rijsbergen in 1979, "asserts that two documents that are similar to each other have a high likelihood of being relevant to the same information need". As document clustering places similar documents in the same cluster, the cluster hypothesis supports the notion that only a small fraction of document clusters need to be searched to fulfil a users information need when submitting a query to a search engine.<br />
<div>
<br /></div>
<div>
A section on exploiting large scale document clustering and the cluster hypothesis to produce a more efficient search engine is available in my PhD thesis, and it is titled "<a href="http://eprints.qut.edu.au/75862/" target="_blank">Distributed information retrieval: Collection distribution, selection and the cluster hypothesis for evaluation of document clustering</a>". The work brings together of two lines of related research. It builds upon the evaluation of document clustering started at the <a href="http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp" target="_blank">INEX XML Mining track</a>. It also extends work with the <a href="http://ktree.sf.net/" target="_blank">K-tree</a> data structure and algorithm, presenting for the first time the TopSig K-tree that works with binary document signatures produced by TopSig. It allows the 50 million web page ClueWeb09 Category B document collection used at the <a href="http://trec.nist.gov/data/webmain.html" target="_blank">TREC Web Track</a> to be clustered in 10 hours into approximately 140,000 clusters using a single thread of Java code. To the best of my knowledge such approaches clustering 50 million documents are not described in the literature without using sampling. There are many opportunities to reduce the time required to cluster via parallel and distributed processing, low level optimisation in a native programming language and using shorter document signatures.<br />
<br />
The clusters produced by TopSig K-tree have been used with a new cluster ranking approach based upon <a href="http://en.wikipedia.org/wiki/Okapi_BM25" target="_blank">Okapi BM25</a> that combines document weights to represent clusters. I have called it CBM625 as it squares BM25 term-document weights and combines them to rank clusters. The final result is that this approach is able to search 13 fold less documents than the previous best reported approach on the <a href="http://lemurproject.org/clueweb09/" target="_blank">ClueWeb09 Category B</a> 50 million document collection. The theoretical clustering evaluation at INEX suggested that fine grained clusters allow better ranking of clusters for collection selection. It used <a href="http://en.wikipedia.org/wiki/Relevance_(information_retrieval)" target="_blank">relevance judgements</a> to place documents relevant to a query in an optimal order with respect to a document clustering, which represents an upper bound for any collection selection approach given the same clustering of documents. These new experimental results demonstrate the effectiveness of fine grain document clustering using a large scale clustering algorithm that is able to produce approximately 140,000 clusters, a collection selection approach and a final ranking of documents by a search engine. The results were evaluated using queries 1-50 from the <a href="http://trec.nist.gov/data/web09.html" target="_blank">TREC 2009 Web Track</a>. Only the first 8 most highly ranked of the total 140,000 document clusters need to be searched to ensure there is no statistically significant difference in retrieval quality.<br />
<br />
I have made the software and some of the data available from this paper at <a href="http://sourceforge.net/projects/ktree/files/docclust_ir/" target="_blank">http://sourceforge.net/projects/ktree/files/docclust_ir/</a>.<br />
<br />
All of the software required to replicate the experiments is available is contained is docclust_ir.tar.gz. This includes the <a href="http://www.atire.org/" target="_blank">ATIRE</a> search engine, <a href="http://topsig.googlecode.com/" target="_blank">TopSig</a> search engine, the version of TopSig K-tree as described in the paper, and the collection selection approach that ranks documents using the clusterings produced by TopSig K-tree. It also includes the scripts to run everything. This code is undocumented, messy, rushed, research code but I am happy to help you with any problems you run into using it. You will also need to obtain the INEX 2009 XML Wikipedia and the ClueWeb09 Category B document collections.<br />
<br />
Furthermore, the clusters of the INEX 2009 Wikipedia and ClueWeb09 Category B are available and the document signatures used to create the clusterings.<br />
<br />
This brings together 4 years worth of research on document clustering evaluation and algorithms. I wish to thank everyone that has supported me along the way!</div>
</div>
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com2tag:blogger.com,1999:blog-766066527791534700.post-67421121660265386682014-05-17T06:56:00.002-07:002014-05-17T11:12:17.951-07:00Bash oneliner - say random quotes from the internet at random intervals<div dir="ltr" style="text-align: left;" trbidi="on">
Here is some weekend fun. Speaking random quotes from the internet at random intervals. Only tested on my Macbook. Not sure if there is the say utility for other platforms.<br />
<br />
<br />
<div class="p1">
<span style="font-family: Courier New, Courier, monospace;">while true; do say `curl -s http://www.quotedb.com/quote/quote.php?action=random_quote | head -n 1 | sed s/document.write\(\'//g | sed s/\<br\>\'\)\;//g`; sleep $((RANDOM%60)); done</span></div>
</div>
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com2tag:blogger.com,1999:blog-766066527791534700.post-13056752700150049442013-06-23T19:28:00.001-07:002013-06-23T19:35:38.517-07:00Minimal Test Collection (MTC) Evaluation Utility<div dir="ltr" style="text-align: left;" trbidi="on">
I have been using mtc-eval from the <a href="http://trec.nist.gov/data/web09.html" target="_blank">TREC 2009 Web Track homepage</a> and I had troubles getting it to run without it crashing with segmentation faults. I found a newer version at the <a href="http://ir.cis.udel.edu/~carteret/downloads.html" target="_blank">author's web page</a> and it fixed any problems I was experiencing. Also, the <a href="http://en.wikipedia.org/wiki/GNU_Scientific_Library" target="_blank">GNU Scientific Library</a> that this software depends on will install without the <a href="http://en.wikipedia.org/wiki/LAPACK" target="_blank">LAPACK</a> and <a href="http://en.wikipedia.org/wiki/BLAS" target="_blank">BLAS</a> dependencies. So remember to install the lapack, lapack-devel, atlas, atlas-devel, blas and blas-devel packages found on most Linux distributions.</div>
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-20256824125167162782013-06-15T21:11:00.001-07:002013-06-15T21:12:18.590-07:00ClusterEval 1.0 Released<div dir="ltr" style="text-align: left;" trbidi="on">
Today I have released <a href="https://mloss.org/revision/view/1331/">ClusterEval 1.0</a>. This program compares a clustering to a ground truth set of categories according to multiple different measures. It also includes a novel approach called 'Divergence from a Random Baseline' that augments existing measures to correct for ineffective clusterings. It has been used in the evaluation of clustering at the <a href="http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp">INEX XML Mining track</a> at INEX in 2009 and 2010, and the upcoming <a href="http://www.multimediaeval.org/mediaeval2013/sed2013/index.html">Social Event Detection task at MediaEval in 2013</a>. It implements cluster quality metrics based on ground truths such as <a href="http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html" target="_blank">Purity</a>, <a href="http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html" target="_blank">Entropy</a>, <a href="http://eprints.qut.edu.au/27756/" target="_blank">Negentropy</a>, <a href="http://en.wikipedia.org/wiki/Cluster_analysis#External_evaluation" target="_blank">F1</a> and <a href="http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html" target="_blank">NMI</a>.<br />
<div>
<br /></div>
<div>
Further details describing the use and functionality of this software are available in the <a href="http://eprints.qut.edu.au/60711/" target="_blank">manual</a>.</div>
<div>
<br /></div>
<div>
Complete details of the quality measures can be found in the paper '<a href="http://eprints.qut.edu.au/53371/" target="_blank">Document Clustering Evaluation: Divergence from a Random Baseline</a>'.</div>
<div>
<br /></div>
<div>
The Social Event Detection task at MediaEval involves automated detection of social events from real life social networks. If this sounds of interest to you, head over the to the <a href="http://www.multimediaeval.org/mediaeval2013/sed2013/index.html" target="_blank">task description page</a> and register.</div>
</div>
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-74332736181944027762013-04-02T16:16:00.000-07:002013-04-02T16:16:00.447-07:00The 2013 Social Event Detection Task<div dir="ltr" style="text-align: left;" trbidi="on">
The task description for the <a href="http://www.multimediaeval.org/mediaeval2013/sed2013/index.html" target="_blank">Social Event Detection Task at MediaEval 2013</a> has been released. The task involves supervised clustering of events from real social media networks.<br />
<br />
Some previous work on clustering evaluation that came from my involvement in the <a href="http://www.inex.otago.ac.nz/tracks/wiki-mine/wiki-mine.asp" target="_blank">INEX XML Mining track</a> in 2009 and 2010 and I described in a paper "<a href="http://eprints.qut.edu.au/53371/" target="_blank">Document Clustering Evaluation: Divergence from a Random Baseline</a>" is being used during the evaluation.<br />
<br />
If this sounds of interest to you, head over the <a href="http://www.multimediaeval.org/mediaeval2013/sed2013/index.html" target="_blank">task description page</a>, and register!</div>
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-39201046734725321072012-06-08T00:01:00.003-07:002012-06-08T00:03:46.268-07:00RIP Spunky<div dir="ltr" style="text-align: left;" trbidi="on">
I have known Spunky the dog since my family received him as a puppy 15 years ago. He was always happy and full of energy. I would often catch him smiling :)<br />
<br />
You can <a href="http://chris.de-vries.id.au/2011/02/spunky-chasing-things.html" target="_blank">see him chasing things in an earlier post</a>.<br />
<br />
He passed away today :(<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuwOtdVUr6XkdcNxRM-PtOUNr_tWVnasICdgI2xSRYpe992xQMCXsnqMkGmaULQJjLsJAzRdA9vpeBCojX3MIL0N49qGMuEqI3ndPLiYfPC5pOkrZAud3Z-7AwNi98X-d6gsvEV56gSiDZ/s1600/spunk.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhuwOtdVUr6XkdcNxRM-PtOUNr_tWVnasICdgI2xSRYpe992xQMCXsnqMkGmaULQJjLsJAzRdA9vpeBCojX3MIL0N49qGMuEqI3ndPLiYfPC5pOkrZAud3Z-7AwNi98X-d6gsvEV56gSiDZ/s320/spunk.JPG" width="240" /></a></div>
<br /></div>chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-34569864423276230812012-05-29T05:00:00.001-07:002012-05-29T05:13:33.945-07:00Opening a TAR file in Python containing millions of files<div dir="ltr" style="text-align: left;" trbidi="on">
I was recently opening at <a href="http://en.wikipedia.org/wiki/Tar_(file_format)" target="_blank">tarball</a> containing millions of XML files. I got about half way through parsing the files and my VM ran out of memory and came to a grinding halt. Something was causing high memory usage. Knowing that Python has automatic memory management I considered the suspects; the tarfile and <a href="http://lxml.de/" target="_blank">lxml</a> modules I was using to process the data.<br />
<br />
I initially thought this may be because I was not closing the buffer created by the <a href="http://docs.python.org/library/tarfile.html#tarfile.TarFile.extractfile" target="_blank">TarFile.extractfile()</a> function. This was not the case.<br />
<br />
It turns out that the TarFile class keeps cached copies of information in a variable called members. This is an undocumented workaround that has been <a href="http://blogs.oucs.ox.ac.uk/inapickle/2011/06/20/high-memory-usage-when-using-pythons-tarfile-module/" target="_blank">documented by Alexander Dutton</a>. The workaround is to set the members variable to an empty list after processing each file.<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">import tarfile</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;">tar = tarfile.open('large.tar.gz', 'r:gz')</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">for tarinfo in tar:</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> <span style="color: #6aa84f;"># open the file from the archive as an in memory buffer</span></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> buf = tar.extractfile(tarinfo)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> for line in buf:</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> <span style="color: #6aa84f;"># do something with the line</span></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> process(line)</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> </span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> <span style="color: #6aa84f;"># free the cached data structures held by the TarFile object</span></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> tar.members = []</span></div>chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-16337099543628798652012-05-14T23:45:00.003-07:002012-05-15T03:48:21.612-07:00Alt key not working in Mac OS X terminal<div dir="ltr" style="text-align: left;" trbidi="on">
I have had trouble switching windows in <a href="http://irssi.org/" target="_blank">irssi</a> (alt + 1, alt +2, ... , alt + n) since I have been using my macbook more often. It turns out that the <a href="http://en.wikipedia.org/wiki/OS_X" target="_blank">OS X</a> terminal application rebinds the alt key for <a href="http://superuser.com/questions/124336/mac-os-x-keyboard-shortcuts-for-terminal" target="_blank">shortcuts</a>.<br />
<br />
It is possible to <a href="http://remibergsma.wordpress.com/2012/01/30/alt-key-aan-de-praat-in-osx-terminal/" target="_blank">rebind the alt key</a> so it will work as expected.<br />
<div>
<br /></div>
To do this go to,<br />
Terminal > Preferences > Settings > Keyboard<br />
<div>
and select 'Use option as meta key' and then the alt key will be passed through to the terminal.</div>
</div>chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-61707999276631741992011-09-28T04:36:00.000-07:002011-11-21T00:21:06.950-08:00Barebones .vimrc<div dir="ltr" style="text-align: left;" trbidi="on">
My barebones .vimrc. Nothing fancy. Just syntax highlighting, auto indenting and tabbing, mouse and an 80 column marker.<br />
<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;">syn on</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set softtabstop=4</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set shiftwidth=4</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set tabstop=4</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set expandtab</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set smarttab</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set autoindent</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set mouse=a</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set textwidth=80</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">set colorcolumn=+1</span><br />
<span style="font-family: 'Courier New', Courier, monospace;">hi ColorColumn ctermbg=darkgrey guibg=</span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">darkgrey</span></div>chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-31695143236527008322011-02-19T09:23:00.000-08:002011-08-18T23:15:57.979-07:00Spunky Chasing ThingsSpunky the dog chasing things.<br />
<br />
<iframe width="100%" height="345" src="http://www.youtube.com/embed/uouTacl0XsU" frameborder="0" allowfullscreen></iframe><br />
<br />
<iframe width="100%" height="345" src="http://www.youtube.com/embed/4_1klU_wjEE" frameborder="0" allowfullscreen></iframe><br />
<br />
<iframe width="100%" height="345" src="http://www.youtube.com/embed/RjvDBB6u1rc" frameborder="0" allowfullscreen></iframe><br />
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-33778075839052625022010-07-04T08:51:00.000-07:002011-08-19T15:00:17.109-07:00Bridge to BridgeI rode along the Brisbane river from the bridge at Indooroopilly to the Gateway bridge and back.<br />
<br />
Date:26/06/2010 7:22 am<br />
Distance:65.0 kilometers<br />
Elapsed Time:3:20:22<br />
Avg. Speed:19.5 km/h<br />
Max. Speed:56.3 km/h<br />
<br />
<iframe width="100%" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.com.au/maps/ms?ie=UTF8&hl=en&msa=0&msid=108787632614172197584.000489e45ab27c50c24a5&ll=-27.481315,153.03515&spn=0.08721,0.1233&output=embed"></iframe> <br />
<br />
<br /><small><a href="http://maps.google.com.au/maps/ms?ie=UTF8&hl=en&msa=0&msid=108787632614172197584.000489e45ab27c50c24a5&ll=-27.481315,153.03515&spn=0.08721,0.1233&" style="color:#0000FF;text-align:left">View Larger Map</a></small><br />
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-66592011785014674742010-06-25T08:34:00.000-07:002011-08-18T23:02:52.870-07:00Hyperref and Cleveref LaTeX ConflictI ran into a conflict between two LaTeX packages when writing my thesis. Both the hyperref and cleveref are useful packages for automatically dealing with references in a document.<br/><br/><a href="http://www.tug.org/applications/hyperref/">Hyperref</a> is a TeX package for making documents with live links in PDF and HTML output formats. This places automatic links for \ref{} and \cite{} commands into the final PDF document. This allows readers to simply click on a reference to a Table, Figure, Equation or Citation and be taken to its location in the document.<br/><br/>The <a href="http://www.ctan.org/tex-archive/help/Catalogue/entries/cleveref.html">cleveref</a> package enhances LaTeX's cross-referencing features, allowing the format of references to be determined automatically according to the type of reference. This is almost like type inference found in modern programming languages. When using the \ref{} command in LaTeX, one typically mentions the type of reference in the text and then \ref{} provides the number for the reference. For example, Table \ref{table:x2} shows the values of x^2 for integers 1 through 10. Using cleveref one can omit the leading Table and the package automatically infers it from the reference. For example, \Cref{table:x2} shows the values of x^2 for integers 1 through 10.<br/><br/>I used cleveref throughout my entire thesis and I included the hyperref package at the end to add PDF links. Unfortunately this somehow manages to conflict with cleveref and it no longer works as expected. The following demonstrates what happens. The caption of the Figure is inserted into the reference.<br/><br/><img src="http://de-vries.ws/cleveref.png" alt="cleveref and hyperref conflicting" width="500" /><br/><br/>I ended up not using the cleveref package for this reason. I am not a LaTeX expert when it comes writing packages, so until I can dedicate some time to learn what is going on I have no working solution.<br/><br/>I used the TeXLive packages from the Ubuntu repository which contains the hyperref package. I downloaded <a href="http://www.ctan.org/tex-archive/help/Catalogue/entries/cleveref.html">cleveref from the CTAN repository</a>.chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com2tag:blogger.com,1999:blog-766066527791534700.post-71286978413555942602010-01-18T07:36:00.000-08:002011-08-18T23:02:52.870-07:00A Crash Course on Modern HardwareCliff Click from Azul Systems <a href="http://www.infoq.com/presentations/click-crash-course-modern-hardware">gives a talk</a> on one of my favourite subjects, Computer Architecture. He looks at the difficulty of predicting performance on modern x86 CPUs.chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-92126605323865173262010-01-12T03:08:00.000-08:002011-08-19T15:03:23.138-07:00Ride to PortsideI rode to Portside this morning to try out MotionX GPS on the iPhone. I found a couple of weird signs along the way.<br />
<br />
Date:12/01/2010 8:49 am<br />
Distance:22.0 kilometers<br />
Elapsed Time:1:12:46<br />
Avg. Speed:18.1 km/h<br />
Max. Speed:44.5 km/h<br />
<br />
<br />
<iframe width="100%" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.com/maps/ms?ie=UTF8&t=p&msa=0&msid=108787632614172197584.00047ceceecf59cc72900&ll=-27.456874,153.052168&spn=0.045698,0.042915&z=14&output=embed"></iframe> <br />
<br />
<small><a href="http://maps.google.com/maps/ms?ie=UTF8&t=p&msa=0&msid=108787632614172197584.00047ceceecf59cc72900&ll=-27.456874,153.052168&spn=0.045698,0.042915&z=14&" style="color:#0000FF;text-align:left">View Larger Map</a></small><br />
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-91684820400916962802010-01-10T13:37:00.000-08:002011-08-18T23:02:52.871-07:00Turing's Cathedral<a href="http://www.edge.org/3rd_culture/dyson05/dyson05_index.html">Turing's Cathedral</a> is an interesting look at computing at the 60th anniversary of John von Neumann's proposal for the digital computer. It cover aspects of computational models, biology, AI and the growing wealth of knowledge on the internet.<br/><br/>The quotes below show how we have far exceeded the original expectations of computation. Even though we are still programming the von Nuemann architecture / Turing machines, I have always wondered how much the languages we use would change given an entirely different computational model.<br/><br/>"When the machine finally became operational in 1951, it had 5 kilobytes of random-access memory: a 32 x 32 x 40 matrix of binary digits, stored as a flickering pattern of electrical charge, shifting from millisecond to millisecond on the surface of 40 cathode-ray tubes."<br/><br/>"By breaking the distinction between numbers that mean things and numbers that do things, von Neumann unleashed the power of the stored-program computer, and our universe would never be the same."<br/><br/>"In the early 1950s, when mean time between memory failure was measured in minutes, no one imagined that a system depending on every bit being in exactly the right place at exactly the right time could be scaled up by a factor of 10^13 in size, and down by a factor of 10^6 in time. Von Neumann, who died prematurely in 1957, became increasingly interested in understanding how biology has managed (and how technology might manage) to construct reliable organisms out of unreliable parts. He believed the von Neumann architecture would soon be replaced by something else. Even if codes could be completely debugged, million-cell memories could never be counted upon, digitally, to behave consistently from one kilocycle to the next."<br/><br/>"As organisms, we possess two outstanding repositories of information: the information conveyed by our genes, and the information stored in our brains. Both of these are based upon non-von-Neumann architectures, and it is no surprise that Von Neumann became fascinated with these examples as he left his chairmanship of the AEC (where he had succeeded Lewis Strauss) and began to lay out the research agenda that cancer prevented him from following up."<br/><br/>"We can divide the computational universe into three sectors: computable problems; non-computable problems (that can be given a finite, exact description but have no effective procedure to deliver a definite result); and, finally, questions whose answers are, in principle, computable, but that, in practice, we are unable to ask in unambiguous language that computers can understand."chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-31385143160667543692009-12-16T03:29:00.000-08:002011-09-10T05:49:33.560-07:00USAEarlier this year I travelled to the USA for <a href="http://www.sigir2009.org">SIGIR 2009</a>. I visited New York City, Boston, Santa Fe, Los Alamos and Taos. I have made another travel map!<br />
<br />
<iframe width="100%" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.com.au/maps/ms?ie=UTF8&hl=en&source=embed&msa=0&msid=108787632614172197584.000479ae0abcefa83220b&ll=13.923404,-148.710937&spn=113.090606,176.132813&z=2&output=embed"></iframe> <br />
<br />
<br />
<small><a href="http://maps.google.com.au/maps/ms?ie=UTF8&hl=en&source=embed&msa=0&msid=108787632614172197584.000479ae0abcefa83220b&ll=13.923404,-148.710937&spn=113.090606,176.132813&z=2&" style="color:#0000FF;text-align:left">View Larger Map</a></small><br />
<br />chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-75565452777790596052009-10-26T15:40:00.000-07:002011-08-18T23:02:52.871-07:00Long DistanceQ: Why did the programmer call his mother long distance?<br/>A: Because that was her name.chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-1052997030937405652009-06-21T01:56:00.000-07:002011-08-18T23:02:52.871-07:00Computing Then<a href="http://computingnow.computer.org/ct">Computing Then</a> is an interesting look into the past of computing. With much focus on the "now" of computing and electronics, this is a nice break. I always enjoy the 16 & 32 years ago column in the IEEE Computer publication. I remember 16 years ago quite vividly but alas I am not 32 yet. The older I get, the less powers of 2 I get to experience. I doubt I will make 2^7.chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-1480859326750573002009-05-26T04:02:00.000-07:002011-08-18T23:02:52.872-07:00Adaptive Compression of the Dynamic Range<a href="http://en.wikipedia.org/wiki/Dynamic_range_compression">Dynamic range compression</a> is a technique used by audio engineers to optimise the distribution of frequencies in a mix. In popular music it is often abused, resulting in a flat and overly loud sound. An audio engineer applies some static parameters of threshold, knee, compression ratio, attack, delay and gain based on the typical listening environment of their anticipated user. This is why you can hear your favourite top 40 track in a noisy environment but you can barely hear classical music at the same volume level on your audio device. Therefore, I propose that these parameters should adapt to the environment of the user. If it is noisy, the compression ratio is pushed up and if it is quiet it can be relaxed. By intuition, I assume that this would be a relatively easy solution to solve with machine learning. My simple explanation with one of the parameters is trivial. I am sure more complex relationships exist between the parameters and user satisfaction. If someone could embed this in a popular music player with the correct audio source, it could be a winning combination. However, this requires an unmastered, or a minimally mastered audio source.chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-63544950469621839022009-02-14T09:31:00.000-08:002011-08-19T15:05:55.032-07:00DustyThis dust storm in Broken Hill just blew my mind. Looks so awesome!<br />
<br />
<iframe width="100%" height="345" src="http://www.youtube.com/embed/95tmYmeHf84#" frameborder="0" allowfullscreen></iframe><br />
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com1tag:blogger.com,1999:blog-766066527791534700.post-55534726029757394592009-01-24T09:18:00.000-08:002011-08-19T15:18:32.336-07:00EuropeI recently traveled to Europe to attend <a href="http://www.inex.otago.ac.nz/">INEX</a>.<br />
<br />
<br />
<iframe width="100%" height="350" frameborder="0" scrolling="no" marginheight="0" marginwidth="0" src="http://maps.google.com.au/maps/ms?ie=UTF8&hl=en&msa=0&msid=108787632614172197584.00046131f8bc6925a392f&ll=13.581921,68.554688&spn=113.167466,175.429688&z=2&output=embed"></iframe> <br />
<br />
<br />
<small><a href="http://maps.google.com.au/maps/ms?ie=UTF8&hl=en&msa=0&msid=108787632614172197584.00046131f8bc6925a392f&ll=13.581921,68.554688&spn=113.167466,175.429688&z=2&" style="color:#0000FF;text-align:left">View Larger Map</a></small><br />
chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0tag:blogger.com,1999:blog-766066527791534700.post-21437660256221828792009-01-23T23:06:00.000-08:002011-08-18T23:02:52.872-07:00500 days# uptime<br/>06:01:30 up 509 days, 6:12, 1 user, load average: 0.00, 0.00, 0.00<br/><br/>Now that I have passed the 500 day mark I can bring myself to install a new distro. Time to say good bye to Slackware after 14 years and replace it with noobuntu server. Package management just seems easier. This is my last box with Slackware still on it.chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com1tag:blogger.com,1999:blog-766066527791534700.post-60334049844394829372008-10-28T18:44:00.000-07:002011-08-18T23:02:52.872-07:00K-tree, NMF and INEX 2008Today I gave a <a href="http://de-vries.ws/presentation.pdf">presentation</a> within my research group at <a href="http://www.qut.edu.au">QUT</a>. It discusses the submissions I made for the <a href="http://xmlmining.lip6.fr/">XML Mining track</a> at <a href="http://www.inex.otago.ac.nz/">INEX 2008</a>. This required classifying documents based on previously known examples (<a href="http://en.wikipedia.org/wiki/Statistical_classification">classification</a>). Another task required grouping similar documents together with no prior information other than the documents themselves (<a href="http://en.wikipedia.org/wiki/Cluster_analysis">clustering</a>). I also looked at different ways to measure cluster quality using negentropy and document link graphs. The <a href="http://sourceforge.net/projects/k-tree">K-tree</a> algorithm is part of my research. This is the first time it was applied to document clustering. The results for the entire track should be out soon. I will also be giving the presentation at the QLD IEEE Computational Intelligence Symposium.chrishttp://www.blogger.com/profile/14662093233141360874noreply@blogger.com0