Tuesday 29 May 2012

Opening a TAR file in Python containing millions of files

I was recently opening at tarball containing millions of XML files. I got about half way through parsing the files and my VM ran out of memory and came to a grinding halt. Something was causing high memory usage. Knowing that Python has automatic memory management I considered the suspects; the tarfile and lxml modules I was using to process the data.

I initially thought this may be because I was not closing the buffer created by the TarFile.extractfile() function. This was not the case.

It turns out that the TarFile class keeps cached copies of information in a variable called members. This is an undocumented workaround that has been documented by Alexander Dutton. The workaround is to set the members variable to an empty list after processing each file.

import tarfile


tar = tarfile.open('large.tar.gz', 'r:gz')
for tarinfo in tar:
  # open the file from the archive as an in memory buffer
  buf = tar.extractfile(tarinfo)
  
  for line in buf:
    # do something with the line
    process(line)
  
  # free the cached data structures held by the TarFile object
  tar.members = []

Monday 14 May 2012

Alt key not working in Mac OS X terminal

I have had trouble switching windows in irssi (alt + 1, alt +2, ... , alt + n) since I have been using my macbook more often. It turns out that the OS X terminal application rebinds the alt key for shortcuts.

It is possible to rebind the alt key so it will work as expected.

To do this go to,
Terminal > Preferences > Settings > Keyboard
and select 'Use option as meta key' and then the alt key will be passed through to the terminal.