I was recently opening at
tarball containing millions of XML files. I got about half way through parsing the files and my VM ran out of memory and came to a grinding halt. Something was causing high memory usage. Knowing that Python has automatic memory management I considered the suspects; the tarfile and
lxml modules I was using to process the data.
I initially thought this may be because I was not closing the buffer created by the
TarFile.extractfile() function. This was not the case.
It turns out that the TarFile class keeps cached copies of information in a variable called members. This is an undocumented workaround that has been
documented by Alexander Dutton. The workaround is to set the members variable to an empty list after processing each file.
import tarfile
tar = tarfile.open('large.tar.gz', 'r:gz')
for tarinfo in tar:
# open the file from the archive as an in memory buffer
buf = tar.extractfile(tarinfo)
for line in buf:
# do something with the line
process(line)
# free the cached data structures held by the TarFile object
tar.members = []