.gz 文件到带有 hive 分隔符的 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25063920/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:18:53  来源:igfitidea点击:

.gz file to pandas DataFrame with hive delimiter

pythonpandashiveparamikotsv

提问by Keith

I am getting a very odd result when I try to load my .gz data file.

当我尝试加载我的 .gz 数据文件时,我得到了一个非常奇怪的结果。

My code is pretty simple

我的代码很简单

dt = pd.read_table(gzip.open(file.gz))

but I get a very odd delimiter. I had expected a tab ('\t') but iPython sees it as a WHITE LEFT-POINTING TRIANGLE. Most other programs do not see it at all. enter image description here

但我得到了一个非常奇怪的分隔符。我原以为有一个制表符 ('\t'),但 iPython 将其视为一个WHITE LEFT-POINTING TRIANGLE。大多数其他程序根本看不到它。在此处输入图片说明

The data originally comes from hive through paramiko, if that matters I can give more details. Does anybody have a suggestion for how to delimit on such a thing?

数据最初来自 hive 通过 paramiko,如果这很重要,我可以提供更多详细信息。有人对如何界定这样的事情有什么建议吗?

EDIT:

编辑:

print(gzip.open("file.gz").read()[-5])

Returns exactly this character.

准确返回这个字符。

And

In [28] gzip.open("file.gz").read()[-5]
Out[28]: '\x01'

回答by Keith

pd.read_table("file.gz",compression='gzip',sep='\x01')

or

或者

pd.read_table(gzip.open('file.gz'),sep='\x01')

Will both do it.

两个都会做。