在 Python 中读取 .tar.gz 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37474767/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read .tar.gz file in Python
提问by KrunalParmar
I have a text file of 25GB. so i compressed it to tar.gz and it became 450 MB. now i want to read that file from python and process the text data.for this i referred question. but in my case code doesn't work. the code is as follows :
我有一个 25GB 的文本文件。所以我将它压缩到 tar.gz 并变成了 450 MB。现在我想从 python 读取该文件并处理文本数据。为此我提到了问题。但在我的情况下,代码不起作用。代码如下:
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f=tar.extractfile(member)
content = f.read()
Data = np.loadtxt(content)
the error is as follows :
错误如下:
Traceback (most recent call last):
File "dataExtPlot.py", line 21, in <module>
content = f.read()
AttributeError: 'NoneType' object has no attribute 'read'
also, Is there any other method to do this task ?
另外,还有其他方法可以完成此任务吗?
回答by Raymond Hettinger
The docstell us that Noneis returned by extractfile()if the member is a not a regular file or link.
该文件告诉我们,没有被返回extractfile()如果该成员是不是一个普通的文件或链接。
One possible solution is to skip over the Noneresults:
一种可能的解决方案是跳过None结果:
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f is not None:
content = f.read()
回答by mhawke
tarfile.extractfile()
can return None
if the member is neither a file nor a link. For example your tar archive might contain directories or device files. To fix:
tarfile.extractfile()
None
如果成员既不是文件也不是链接,则可以返回。例如,您的 tar 存档可能包含目录或设备文件。修理:
import tarfile
import numpy as np
tar = tarfile.open("filename.tar.gz", "r:gz")
for member in tar.getmembers():
f = tar.extractfile(member)
if f:
content = f.read()
Data = np.loadtxt(content)
回答by VICTOR
You may try this one
你可以试试这个
t = tarfile.open("filename.gz", "r")
for filename in t.getnames():
try:
f = t.extractfile(filename)
Data = f.read()
print filename, ':', Data
except :
print 'ERROR: Did not find %s in tar archive' % filename
回答by Philippe Ombredanne
You cannot "read" the content of some special files such as links yet tar supports them and tarfile will extract them alright. When tarfile
extracts them, it does not return a file-like object but None. And you get an error because your tarball contains such a special file.
您无法“读取”某些特殊文件(例如链接)的内容,但 tar 支持它们并且 tarfile 可以正常提取它们。当tarfile
把它们提取出来,它不返回一个类文件对象,但无。你会得到一个错误,因为你的 tarball 包含这样一个特殊的文件。
One approach is to determine the type of an entry in a tarball you are processing ahead of extracting it: with this information at hand you can decide whether or not you can "read" the file. You can achieve this by calling tarfile.getmembers()
returns tarfile.TarInfo
s that contain detailed information about the type of file contained in the tarball.
一种方法是在提取之前确定您正在处理的 tarball 中条目的类型:有了这些信息,您就可以决定是否可以“读取”文件。您可以通过调用达到这个tarfile.getmembers()
回报tarfile.TarInfo
包含有关文件类型的详细信息S包含在压缩包。
The tarfile.TarInfo
class has all the attributes and methods you need to determine the type of tar member such as isfile()
or isdir()
or tinfo.islnk()
or tinfo.issym()
and then accordingly decide what do to with each member (extract or not, etc).
本tarfile.TarInfo
类有所有的属性和方法,你需要确定焦油成员的类型,如isfile()
或isdir()
或tinfo.islnk()
或tinfo.issym()
然后据此决定做什么与每个成员(提取物或没有,等等)。
For instance I use these to test the type of file in this patched tarfileto skip extracting special files and process links in a special way:
例如,我使用这些来测试此修补 tarfile 中的文件类型,以跳过以特殊方式提取特殊文件和处理链接的过程:
for tinfo in tar.getmembers():
is_special = not (tinfo.isfile() or tinfo.isdir()
or tinfo.islnk() or tinfo.issym())
...
回答by Jadli
In Jupyter notebook you can do like below
在 Jupyter notebook 中,你可以像下面这样
!wget -c http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz -O - | tar -xz
回答by MonkandMonkey
My needs:
我的需求:
- Python3.
- My tar.gz file consists of multiple
utf-8
text files and dir. - Need to read text lines from all files.
- 蟒蛇3。
- 我的 tar.gz 文件由多个
utf-8
文本文件和目录组成。 - 需要从所有文件中读取文本行。
Problems:
问题:
- The tar object returned by tar.getmembers() maybe
None
. - The content
extractfile(fname)
returns is a bytes str (e.g. b'Hello\t\xe4\xbd\xa0\xe5\xa5\xbd'). Unicode char doesn't display correctly.
- tar.getmembers() 返回的 tar 对象可能是
None
。 extractfile(fname)
返回的内容是一个字节字符串(例如 b'Hello\t\xe4\xbd\xa0\xe5\xa5\xbd')。Unicode 字符显示不正确。
Solutions:
解决方案:
- Check the type of tar object first. I reference the example in docof tarfile lib. (Search "How to read a gzip compressed tar archive and display some member information")
- Decode from byte str to normal str. (ref- most voted answer)
- 首先检查 tar 对象的类型。我参考了tarfile lib文档中的示例。(搜索“如何读取 gzip 压缩的 tar 存档并显示一些成员信息”)
- 从字节 str 解码为普通 str。(参考- 投票最多的答案)
Code:
代码:
with tarfile.open("sample.tar.gz", "r:gz") as tar:
for tarinfo in tar:
logger.info(f"{tarinfo.name} is {tarinfo.size} bytes in size and is: ")
if tarinfo.isreg():
logger.info(f"Is regular file: {tarinfo.name}")
f = tar.extractfile(tarinfo.name)
# To get the str instead of bytes str
# Decode with proper coding, e.g. utf-8
content = f.read().decode('utf-8', errors='ignore')
# Split the long str into lines
# Specify your line-sep: e.g. \n
lines = content.split('\n')
for i, line in enumerate(lines):
print(f"[{i}]: {line}\n")
elif tarinfo.isdir():
logger.info(f"Is dir: {tarinfo.name}")
else:
logger.info(f"Is something else: {tarinfo.name}.")