pandas dask 数据框应用元

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/44432868/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:44:59  来源:igfitidea点击:

dask dataframe apply meta

pythonpandasdask

提问by Matti Lyra

I'm wanting to do a frequency count on a single column of a daskdataframe. The code works, but I get an warningcomplaining that metais not defined. If I try to define metaI get an error AttributeError: 'DataFrame' object has no attribute 'name'. For this particular use case it doesn't look like I need to define metabut I'd like to know how to do that for future reference.

我想对dask数据帧的单列进行频率计数。该代码有效,但我收到未定义的warning抱怨meta。如果我尝试定义meta我得到一个错误AttributeError: 'DataFrame' object has no attribute 'name'。对于这个特定的用例,我似乎不需要定义,meta但我想知道如何做以供将来参考。

Dummy dataframe and the column frequencies

虚拟数据框和列频率

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame([['Sam', 'Alex', 'David', 'Sarah', 'Alice', 'Sam', 'Anna'],
                   ['Sam', 'David', 'David', 'Alice', 'Sam', 'Alice', 'Sam'],
                   [12, 10, 15, 23, 18, 20, 26]],
                  index=['Column A', 'Column B', 'Column C']).T
dask_df = dd.from_pandas(df)


In [39]: dask_df.head()
Out[39]: 
  Column A Column B Column C
0      Sam      Sam       12
1     Alex    David       10
2    David    David       15
3    Sarah    Alice       23
4    Alice      Sam       18


(dask_df.groupby('Column B')
        .apply(lambda group: len(group))
       ).compute()

UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)
Out[60]: 
Column B
Alice    2
David    2
Sam      3
dtype: int64


Trying to define metaproduces AttributeError

试图定义meta产品AttributeError

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta={'Column B': 'int'})).compute()

same for this

同样的

 (dask_df.groupby('Column B')
         .apply(lambda d: len(d), meta=pd.DataFrame({'Column B': 'int'}))).compute()

same if I try having the dtypebe intinstead of "int"or for that matter 'f8'or np.float64so it doesn't seem like it's the dtypethat is causing the problem.

同样的,如果我尝试具有dtypeint代替"int"或与此有关'f8'np.float64因此它似乎并不像它的dtype所造成的问题。

The documentation on metaseems to imply that I should be doing exactly what I'm trying to do (http://dask.pydata.org/en/latest/dataframe-design.html#metadata).

上的文档meta似乎暗示我应该做我正在尝试做的事情(http://dask.pydata.org/en/latest/dataframe-design.html#metadata)。

What is meta? and how am I supposed to define it?

什么是meta?我应该如何定义它?

Using python 3.6dask 0.14.3and pandas 0.20.2

使用python 3.6dask 0.14.3pandas 0.20.2

回答by mdurant

metais the prescription of the names/types of the output from the computation. This is required because apply()is flexible enough that it can produce just about anything from a dataframe. As you can see, if you don't provide a meta, then dask actually computes part of the data, to see what the types should be - which is fine, but you should know it is happening. You can avoid this pre-computation (which can be expensive) and be more explicit when you know what the output should look like, by providing a zero-row version of the output (dataframe or series), or just the types.

meta是计算输出的名称/类型的规定。这是必需的,因为apply()它足够灵活,可以从数据帧生成几乎任何内容。如您所见,如果您不提供meta,则 dask 实际上计算部分数据,以查看类型应该是什么 - 这很好,但您应该知道它正在发生。您可以通过提供零行版本的输出(数据帧或系列)或仅提供类型来避免这种预先计算(这可能很昂贵)并在您知道输出应该是什么样子时更加明确。

The output of your computation is actually a series, so the following is the simplest that works

你的计算输出实际上是一个系列,所以下面是最简单的

(dask_df.groupby('Column B')
     .apply(len, meta=('int'))).compute()

but more accurate would be

但更准确的是

(dask_df.groupby('Column B')
     .apply(len, meta=pd.Series(dtype='int', name='Column B')))