pandas 如何将压缩的 (gz) CSV 文件读入 dask 数据框?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39924518/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to read a compressed (gz) CSV file into a dask Dataframe?
提问by Magellan88
Is there a way to read a .csv file that is compressed via gz into a dask dataframe?
有没有办法读取通过 gz 压缩到 dask 数据帧的 .csv 文件?
I've tried it directly with
我已经直接尝试过
import dask.dataframe as dd
df = dd.read_csv("Data.gz" )
but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression"
parameter but compression = "gz"
won't work and I can't find any documentation so far.
但得到一个 unicode 错误(可能是因为它正在解释压缩字节)有一个"compression"
参数但compression = "gz"
不起作用,到目前为止我找不到任何文档。
With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.
使用Pandas,我可以直接读取文件而不会出现问题,除了结果会炸毁我的记忆;-) 但是如果我限制行数,它就可以正常工作。
import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)
采纳答案by Christian Alis
It's actually a long-standing limitation of dask. Load the files with dask.delayed
instead:
这实际上是dask的长期限制。使用dask.delayed
代替加载文件:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]
df = dd.from_delayed(dfs) # df is a dask dataframe
回答by de1
Panda's current documentation says:
Pandas当前的文档说:
compression : {‘infer', ‘gzip', ‘bz2', ‘zip', ‘xz', None}, default ‘infer'
压缩:{'infer', 'gzip', 'bz2', 'zip', 'xz', None},默认为 'infer'
Since 'infer' is the default, that would explain why it is working with pandas.
由于 'infer' 是默认值,这就解释了为什么它与 Pandas 一起工作。
Dask's documentation on the compressionargument:
Dask 关于压缩参数的文档:
String like ‘gzip' or ‘xz'. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
像“gzip”或“xz”这样的字符串。必须支持高效的随机访问。具有与已知压缩算法(gz、bz2)对应的扩展名的文件名将相应地自动压缩
That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.
这表明它还应该至少推断出gz的压缩。它没有(并且在 0.15.3 中仍然没有)可能是一个错误。但是,它使用compression='gzip' 工作。
i.e.:
IE:
import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')
回答by Dervin Thunk
Without the file it's difficult to say. what if you set the encoding like # -*- coding: latin-1 -*-
? or since read_csv
is based off of Pandas, you may even dd.read_csv('Data.gz', encoding='utf-8')
. Here's the list of Python encodings: https://docs.python.org/3/library/codecs.html#standard-encodings
没有文件很难说。如果你设置了编码like # -*- coding: latin-1 -*-
怎么办?或者因为read_csv
基于 Pandas,你甚至可以dd.read_csv('Data.gz', encoding='utf-8')
. 这是 Python 编码列表:https: //docs.python.org/3/library/codecs.html#standard-encodings