pandas dask 数据框应用元
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/44432868/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
dask dataframe apply meta
提问by Matti Lyra
I'm wanting to do a frequency count on a single column of a dask
dataframe. The code works, but I get an warning
complaining that meta
is not defined. If I try to define meta
I get an error AttributeError: 'DataFrame' object has no attribute 'name'
. For this particular use case it doesn't look like I need to define meta
but I'd like to know how to do that for future reference.
我想对dask
数据帧的单列进行频率计数。该代码有效,但我收到未定义的warning
抱怨meta
。如果我尝试定义meta
我得到一个错误AttributeError: 'DataFrame' object has no attribute 'name'
。对于这个特定的用例,我似乎不需要定义,meta
但我想知道如何做以供将来参考。
Dummy dataframe and the column frequencies
虚拟数据框和列频率
import pandas as pd
from dask import dataframe as dd
df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],
['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],
[12, 10, 15, 23, 18, 20, 26]],
index=['Column A', 'Column B', 'Column C']).T
dask_df = dd.from_pandas(df)
In [39]: dask_df.head()
Out[39]:
Column A Column B Column C
0 Sam Sam 12
1 Alex David 10
2 David David 15
3 Sarah Alice 23
4 Alice Sam 18
(dask_df.groupby('Column B')
.apply(lambda group: len(group))
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
warnings.warn(msg)
Out[60]:
Column B
Alice 2
David 2
Sam 3
dtype: int64
Trying to define meta
produces AttributeError
试图定义meta
产品AttributeError
(dask_df.groupby('Column B')
.apply(lambda d: len(d), meta={'Column B': 'int'})).compute()
same for this
同样的
(dask_df.groupby('Column B')
.apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()
same if I try having the dtype
be int
instead of "int"
or for that matter 'f8'
or np.float64
so it doesn't seem like it's the dtype
that is causing the problem.
同样的,如果我尝试具有dtype
可int
代替"int"
或与此有关'f8'
或np.float64
因此它似乎并不像它的dtype
所造成的问题。
The documentation on meta
seems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).
上的文档meta
似乎暗示我应该做我正在尝试做的事情(http://dask.pydata.org/en/latest/dataframe-design.html#metadata)。
What is meta
? and how am I supposed to define it?
什么是meta
?我应该如何定义它?
Using python 3.6
dask 0.14.3
and pandas 0.20.2
使用python 3.6
dask 0.14.3
和pandas 0.20.2
回答by mdurant
meta
is the prescription of the names/types of the output from the computation. This is required because apply()
is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta
, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening.
You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.
meta
是计算输出的名称/类型的规定。这是必需的,因为apply()
它足够灵活,可以从数据帧生成几乎任何内容。如您所见,如果您不提供meta
,则 dask 实际上计算部分数据,以查看类型应该是什么 - 这很好,但您应该知道它正在发生。您可以通过提供零行版本的输出(数据帧或系列)或仅提供类型来避免这种预先计算(这可能很昂贵)并在您知道输出应该是什么样子时更加明确。
The output of your computation is actually a series, so the following is the simplest that works
你的计算输出实际上是一个系列,所以下面是最简单的
(dask_df.groupby('Column B')
.apply(len, meta=('int'))).compute()
but more accurate would be
但更准确的是
(dask_df.groupby('Column B')
.apply(len, meta=pd.Series(dtype='int', name='Column B')))