pandas 如何将压缩的 (gz) CSV 文件读入 dask 数据框？

Question

提问by Magellan88

Is there a way to read a .csv file that is compressed via gz into a dask dataframe?

有没有办法读取通过 gz 压缩到 dask 数据帧的 .csv 文件？

I've tried it directly with

我已经直接尝试过

import dask.dataframe as dd
df = dd.read_csv("Data.gz" )

but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression"parameter but compression = "gz"won't work and I can't find any documentation so far.

但得到一个 unicode 错误（可能是因为它正在解释压缩字节）有一个"compression"参数但compression = "gz"不起作用，到目前为止我找不到任何文档。

With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.

使用Pandas，我可以直接读取文件而不会出现问题，除了结果会炸毁我的记忆;-) 但是如果我限制行数，它就可以正常工作。

import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)

Answer 1

采纳答案by Christian Alis

It's actually a long-standing limitation of dask. Load the files with dask.delayedinstead:

这实际上是dask的长期限制。使用dask.delayed代替加载文件：

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]

df = dd.from_delayed(dfs) # df is a dask dataframe

Answer 2

回答by de1

Panda's current documentation says:

Pandas当前的文档说：

compression : {‘infer', ‘gzip', ‘bz2', ‘zip', ‘xz', None}, default ‘infer'

压缩：{'infer', 'gzip', 'bz2', 'zip', 'xz', None}，默认为 'infer'

Since 'infer' is the default, that would explain why it is working with pandas.

由于 'infer' 是默认值，这就解释了为什么它与 Pandas 一起工作。

Dask's documentation on the compressionargument:

Dask 关于压缩参数的文档：

String like ‘gzip' or ‘xz'. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

像“gzip”或“xz”这样的字符串。必须支持高效的随机访问。具有与已知压缩算法（gz、bz2）对应的扩展名的文件名将相应地自动压缩

That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.

这表明它还应该至少推断出gz的压缩。它没有（并且在 0.15.3 中仍然没有）可能是一个错误。但是，它使用compression='gzip' 工作。

i.e.:

IE：

import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')

Answer 3

回答by Dervin Thunk

Without the file it's difficult to say. what if you set the encoding like # -*- coding: latin-1 -*-? or since read_csvis based off of Pandas, you may even dd.read_csv('Data.gz', encoding='utf-8'). Here's the list of Python encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

没有文件很难说。如果你设置了编码like # -*- coding: latin-1 -*-怎么办？或者因为read_csv基于 Pandas，你甚至可以dd.read_csv('Data.gz', encoding='utf-8'). 这是 Python 编码列表：https: //docs.python.org/3/library/codecs.html#standard-encodings

pandas 如何将压缩的 (gz) CSV 文件读入 dask 数据框？

提问by Magellan88

采纳答案by Christian Alis

回答by de1

回答by Dervin Thunk

相关推荐

最近更新

标签

pandas 如何将压缩的 (gz) CSV 文件读入 dask 数据框？

提问by Magellan88

采纳答案by Christian Alis

回答by de1

回答by Dervin Thunk

相关推荐

pandas 如何匹配pandas DataFrame中的多列“间隔”？

Pandas - 散布矩阵集标题

pandas 重采样错误：无法使用方法或限制重新索引非唯一索引

pandas 熊猫从日期时间索引中删除秒

相关推荐

最近更新

标签