Saturday, 17 May 2014

Bash oneliner - say random quotes from the internet at random intervals

Here is some weekend fun. Speaking random quotes from the internet at random intervals. Only tested on my Macbook. Not sure if there is the say utility for other platforms.


while true; do say `curl -s http://www.quotedb.com/quote/quote.php?action=random_quote | head -n 1 | sed s/document.write\(\'//g | sed s/\<br\>\'\)\;//g`; sleep $((RANDOM%60)); done

Sunday, 23 June 2013

Minimal Test Collection (MTC) Evaluation Utility

I have been using mtc-eval from the TREC 2009 Web Track homepage and I had troubles getting it to run without it crashing with segmentation faults. I found a newer version at the author's web page and it fixed any problems I was experiencing. Also, the GNU Scientific Library that this software depends on will install without the LAPACK and BLAS dependencies. So remember to install the lapack, lapack-devel, atlas, atlas-devel, blas and blas-devel packages found on most Linux distributions.

Saturday, 15 June 2013

ClusterEval 1.0 Released

Today I have released ClusterEval 1.0. This program compares a clustering to a ground truth set of categories according to multiple different measures. It also includes a novel approach called 'Divergence from a Random Baseline' that augments existing measures to correct for ineffective clusterings. It has been used in the evaluation of clustering at the INEX XML Mining track at INEX in 2009 and 2010, and the upcoming Social Event Detection task at MediaEval in 2013. It implements cluster quality metrics based on ground truths such as Purity, Entropy, Negentropy, F1 and NMI.

Further details describing the use and functionality of this software are available in the manual.

Complete details of the quality measures can be found in the paper 'Document Clustering Evaluation: Divergence from a Random Baseline'.

The Social Event Detection task at MediaEval involves automated detection of social events from real life social networks. If this sounds of interest to you, head over the to the task description page and register.

Tuesday, 2 April 2013

The 2013 Social Event Detection Task

The task description for the Social Event Detection Task at MediaEval 2013 has been released. The task involves supervised clustering of events from real social media networks.

Some previous work on clustering evaluation that came from my involvement in the INEX XML Mining track in 2009 and 2010 and I described in a paper "Document Clustering Evaluation: Divergence from a Random Baseline" is being used during the evaluation.

If this sounds of interest to you, head over the task description page, and register!

Friday, 8 June 2012

RIP Spunky

I have known Spunky the dog since my family received him as a puppy 15 years ago. He was always happy and full of energy. I would often catch him smiling :)

You can see him chasing things in an earlier post.

He passed away today :(


Tuesday, 29 May 2012

Opening a TAR file in Python containing millions of files

I was recently opening at tarball containing millions of XML files. I got about half way through parsing the files and my VM ran out of memory and came to a grinding halt. Something was causing high memory usage. Knowing that Python has automatic memory management I considered the suspects; the tarfile and lxml modules I was using to process the data.

I initially thought this may be because I was not closing the buffer created by the TarFile.extractfile() function. This was not the case.

It turns out that the TarFile class keeps cached copies of information in a variable called members. This is an undocumented workaround that has been documented by Alexander Dutton. The workaround is to set the members variable to an empty list after processing each file.

import tarfile


tar = tarfile.open('large.tar.gz', 'r:gz')
for tarinfo in tar:
  # open the file from the archive as an in memory buffer
  buf = tar.extractfile(tarinfo)
  
  for line in buf:
    # do something with the line
    process(line)
  
  # free the cached data structures held by the TarFile object
  tar.members = []

Monday, 14 May 2012

Alt key not working in Mac OS X terminal

I have had trouble switching windows in irssi (alt + 1, alt +2, ... , alt + n) since I have been using my macbook more often. It turns out that the OS X terminal application rebinds the alt key for shortcuts.

It is possible to rebind the alt key so it will work as expected.

To do this go to,
Terminal > Preferences > Settings > Keyboard
and select 'Use option as meta key' and then the alt key will be passed through to the terminal.