Tuesday, 29 May 2012

Opening a TAR file in Python containing millions of files

I was recently opening at tarball containing millions of XML files. I got about half way through parsing the files and my VM ran out of memory and came to a grinding halt. Something was causing high memory usage. Knowing that Python has automatic memory management I considered the suspects; the tarfile and lxml modules I was using to process the data.

I initially thought this may be because I was not closing the buffer created by the TarFile.extractfile() function. This was not the case.

It turns out that the TarFile class keeps cached copies of information in a variable called members. This is an undocumented workaround that has been documented by Alexander Dutton. The workaround is to set the members variable to an empty list after processing each file.

import tarfile


tar = tarfile.open('large.tar.gz', 'r:gz')
for tarinfo in tar:
  # open the file from the archive as an in memory buffer
  buf = tar.extractfile(tarinfo)
  
  for line in buf:
    # do something with the line
    process(line)
  
  # free the cached data structures held by the TarFile object
  tar.members = []

No comments:

Post a Comment