在 Python 中读取 .tar.gz 文件

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37474767/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:26:25  来源:igfitidea点击:

Read .tar.gz file in Python

pythonfiletargzip

提问by KrunalParmar

I have a text file of 25GB. so i compressed it to tar.gz and it became 450 MB. now i want to read that file from python and process the text data.for this i referred question. but in my case code doesn't work. the code is as follows :

我有一个 25GB 的文本文件。所以我将它压缩到 tar.gz 并变成了 450 MB。现在我想从 python 读取该文件并处理文本数据。为此我提到了问题。但在我的情况下,代码不起作用。代码如下:

import tarfile
import numpy as np 

tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
     f=tar.extractfile(member)
     content = f.read()
     Data = np.loadtxt(content)

the error is as follows :

错误如下:

Traceback (most recent call last):
  File "dataExtPlot.py", line 21, in <module>
    content = f.read()
AttributeError: 'NoneType' object has no attribute 'read'

also, Is there any other method to do this task ?

另外,还有其他方法可以完成此任务吗?

回答by Raymond Hettinger

The docstell us that Noneis returned by extractfile()if the member is a not a regular file or link.

文件告诉我们,没有被返回extractfile()如果该成员是不是一个普通的文件或链接。

One possible solution is to skip over the Noneresults:

一种可能的解决方案是跳过None结果:

tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
     f = tar.extractfile(member)
     if f is not None:
         content = f.read()

回答by mhawke

tarfile.extractfile()can return Noneif the member is neither a file nor a link. For example your tar archive might contain directories or device files. To fix:

tarfile.extractfile()None如果成员既不是文件也不是链接,则可以返回。例如,您的 tar 存档可能包含目录或设备文件。修理:

import tarfile
import numpy as np 

tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
     f = tar.extractfile(member)
     if f:
         content = f.read()
         Data = np.loadtxt(content)

回答by VICTOR

You may try this one

你可以试试这个

t = tarfile.open("filename.gz", "r")
for filename in t.getnames():
    try:
        f = t.extractfile(filename)
        Data = f.read()
        print filename, ':', Data
    except :
        print 'ERROR: Did not find %s in tar archive' % filename

回答by Philippe Ombredanne

You cannot "read" the content of some special files such as links yet tar supports them and tarfile will extract them alright. When tarfileextracts them, it does not return a file-like object but None. And you get an error because your tarball contains such a special file.

您无法“读取”某些特殊文件(例如链接)的内容,但 tar 支持它们并且 tarfile 可以正常提取它们。当tarfile把它们提取出来,它不返回一个类文件对象,但无。你会得到一个错误,因为你的 tarball 包含这样一个特殊的文件。

One approach is to determine the type of an entry in a tarball you are processing ahead of extracting it: with this information at hand you can decide whether or not you can "read" the file. You can achieve this by calling tarfile.getmembers()returns tarfile.TarInfos that contain detailed information about the type of file contained in the tarball.

一种方法是在提取之前确定您正在处理的 tarball 中条目的类型:有了这些信息,您就可以决定是否可以“读取”文件。您可以通过调用达到这个tarfile.getmembers()回报tarfile.TarInfo包含有关文件类型的详细信息S包含在压缩包。

The tarfile.TarInfoclass has all the attributes and methods you need to determine the type of tar member such as isfile()or isdir()or tinfo.islnk()or tinfo.issym()and then accordingly decide what do to with each member (extract or not, etc).

tarfile.TarInfo类有所有的属性和方法,你需要确定焦油成员的类型,如isfile()isdir()tinfo.islnk()tinfo.issym()然后据此决定做什么与每个成员(提取物或没有,等等)。

For instance I use these to test the type of file in this patched tarfileto skip extracting special files and process links in a special way:

例如,我使用这些来测试此修补 tarfile 中的文件类型,以跳过以特殊方式提取特殊文件和处理链接的过程:

for tinfo in tar.getmembers():
    is_special = not (tinfo.isfile() or tinfo.isdir()
                      or tinfo.islnk() or tinfo.issym())
...

回答by Jadli

In Jupyter notebook you can do like below

在 Jupyter notebook 中,你可以像下面这样

!wget -c http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -O - | tar -xz

回答by MonkandMonkey

My needs:

我的需求:

  1. Python3.
  2. My tar.gz file consists of multiple utf-8text files and dir.
  3. Need to read text lines from all files.
  1. 蟒蛇3。
  2. 我的 tar.gz 文件由多个utf-8文本文件和目录组成。
  3. 需要从所有文件中读取文本行。

Problems:

问题:

  1. The tar object returned by tar.getmembers() maybe None.
  2. The content extractfile(fname)returns is a bytes str (e.g. b'Hello\t\xe4\xbd\xa0\xe5\xa5\xbd'). Unicode char doesn't display correctly.
  1. tar.getmembers() 返回的 tar 对象可能是None
  2. extractfile(fname)返回的内容是一个字节字符串(例如 b'Hello\t\xe4\xbd\xa0\xe5\xa5\xbd')。Unicode 字符显示不正确。

Solutions:

解决方案:

  1. Check the type of tar object first. I reference the example in docof tarfile lib. (Search "How to read a gzip compressed tar archive and display some member information")
  2. Decode from byte str to normal str. (ref- most voted answer)
  1. 首先检查 tar 对象的类型。我参考了tarfile lib文档中的示例。(搜索“如何读取 gzip 压缩的 tar 存档并显示一些成员信息”)
  2. 从字节 str 解码为普通 str。(参考- 投票最多的答案)

Code:

代码:

with tarfile.open("sample.tar.gz", "r:gz") as tar:
for tarinfo in tar:
    logger.info(f"{tarinfo.name} is {tarinfo.size} bytes in size and is: ")
    if tarinfo.isreg():
        logger.info(f"Is regular file: {tarinfo.name}")
        f = tar.extractfile(tarinfo.name)  
        # To get the str instead of bytes str
        # Decode with proper coding, e.g. utf-8
        content = f.read().decode('utf-8', errors='ignore')
        # Split the long str into lines
        # Specify your line-sep: e.g. \n
        lines = content.split('\n')
        for i, line in enumerate(lines):
            print(f"[{i}]: {line}\n")
    elif tarinfo.isdir():
        logger.info(f"Is dir: {tarinfo.name}")
    else:
        logger.info(f"Is something else: {tarinfo.name}.")