pandas 如何将压缩的 (gz) CSV 文件读入 dask 数据框?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39924518/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:09:57  来源:igfitidea点击:

How to read a compressed (gz) CSV file into a dask Dataframe?

pythoncsvpandasdask

提问by Magellan88

Is there a way to read a .csv file that is compressed via gz into a dask dataframe?

有没有办法读取通过 gz 压缩到 dask 数据帧的 .csv 文件?

I've tried it directly with

我已经直接尝试过

import dask.dataframe as dd
df = dd.read_csv("Data.gz" )

but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression"parameter but compression = "gz"won't work and I can't find any documentation so far.

但得到一个 unicode 错误(可能是因为它正在解释压缩字节)有一个"compression"参数但compression = "gz"不起作用,到目前为止我找不到任何文档。

With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.

使用Pandas,我可以直接读取文件而不会出现问题,除了结果会炸毁我的记忆;-) 但是如果我限制行数,它就可以正常工作。

import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)

采纳答案by Christian Alis

It's actually a long-standing limitation of dask. Load the files with dask.delayedinstead:

这实际上是dask的长期限制。使用dask.delayed代替加载文件:

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]

df = dd.from_delayed(dfs) # df is a dask dataframe

回答by de1

Panda's current documentation says:

Pandas当前的文档说:

compression : {‘infer', ‘gzip', ‘bz2', ‘zip', ‘xz', None}, default ‘infer'

压缩:{'infer', 'gzip', 'bz2', 'zip', 'xz', None},默认为 'infer'

Since 'infer' is the default, that would explain why it is working with pandas.

由于 'infer' 是默认值,这就解释了为什么它与 Pandas 一起工作。

Dask's documentation on the compressionargument:

Dask 关于压缩参数的文档:

String like ‘gzip' or ‘xz'. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

像“gzip”或“xz”这样的字符串。必须支持高效的随机访问。具有与已知压缩算法(gz、bz2)对应的扩展名的文件名将相应地自动压缩

That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.

这表明它还应该至少推断出gz的压缩。它没有(并且在 0.15.3 中仍然没有)可能是一个错误。但是,它使用compression='gzip' 工作。

i.e.:

IE:

import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')

回答by Dervin Thunk

Without the file it's difficult to say. what if you set the encoding like # -*- coding: latin-1 -*-? or since read_csvis based off of Pandas, you may even dd.read_csv('Data.gz', encoding='utf-8'). Here's the list of Python encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

没有文件很难说。如果你设置了编码like # -*- coding: latin-1 -*-怎么办?或者因为read_csv基于 Pandas,你甚至可以dd.read_csv('Data.gz', encoding='utf-8'). 这是 Python 编码列表:https: //docs.python.org/3/library/codecs.html#standard-encodings