pandas 如何用dask映射一列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40019905/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to map a column with dask
提问by wishi
I want to apply a mapping on a DataFrame column. With Pandas this is straight forward:
我想在 DataFrame 列上应用映射。对于 Pandas,这是直接的:
df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap))
This writes the infos
column, based on the custom_map
function, and uses the rows in numbers for the lambda
statement.
这将infos
根据custom_map
函数写入列,并使用数字中的行作为lambda
语句。
With daskthis isn't that simple. ddf
is a dask DataFrame. map_partitions
is the equivalent to parallel execution of the mapping on a part of the DataFrame.
对于dask,这不是那么简单。ddf
是一个 dask 数据帧。map_partitions
相当于在 DataFrame 的一部分上并行执行映射。
This does notwork because you don't define columns like that in dask.
这确实不工作,因为你没有定义一样,在DASK列。
ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap))
Does anyone know how I can use columns here? I don't understand their API documentationat all.
有谁知道我如何在这里使用列?我根本不了解他们的API 文档。
回答by MRocklin
You can use the .mapmethod, exactly as in Pandas
您可以使用.map方法,就像在 Pandas 中一样
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({'x': [1, 2, 3]})
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: df.x.map(lambda x: x + 1)
Out[5]:
0 2
1 3
2 4
Name: x, dtype: int64
In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]:
0 2
1 3
2 4
Name: x, dtype: int64
Metadata
元数据
You may be asked to provide a meta=
keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitions
here:
可能会要求您提供meta=
关键字。这让 dask.dataframe 知道函数的输出名称和类型。从map_partitions
这里复制文档字符串:
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty pd.DataFrame or pd.Series that matches the dtypes and
column names of the output. This metadata is necessary for many
algorithms in dask dataframe to work. For ease of use, some
alternative inputs are also available. Instead of a DataFrame,
a dict of {name: dtype} or iterable of (name, dtype) can be
provided. Instead of a series, a tuple of (name, dtype) can be
used. If not provided, dask will try to infer the metadata.
This may lead to unexpected results, so providing meta is
recommended.
For more information, see dask.dataframe.utils.make_meta.
So in the example above, where my output will be a series with name 'x'
and dtype int
I can do either of the following to be more explicit
因此,在上面的示例中,我的输出将是一个带有 name'x'
和 dtype的系列,int
我可以执行以下任一操作以更加明确
>>> ddf.x.map(lambda x: x + 1, meta=('x', int))
or
或者
>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))
This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.
这告诉 dask.dataframe 可以从我们的函数中得到什么。如果没有给出元数据,那么 dask.dataframe 将尝试在一小段数据上运行你的函数。如果失败,它将引发一个错误,寻求帮助。