pandas 如何用dask映射一列

Question

提问by wishi

I want to apply a mapping on a DataFrame column. With Pandas this is straight forward:

我想在 DataFrame 列上应用映射。对于 Pandas，这是直接的：

df["infos"] = df2["numbers"].map(lambda nr: custom_map(nr, hashmap))

This writes the infoscolumn, based on the custom_mapfunction, and uses the rows in numbers for the lambdastatement.

这将infos根据custom_map函数写入列，并使用数字中的行作为lambda语句。

With daskthis isn't that simple. ddfis a dask DataFrame. map_partitionsis the equivalent to parallel execution of the mapping on a part of the DataFrame.

对于dask，这不是那么简单。ddf是一个 dask 数据帧。map_partitions相当于在 DataFrame 的一部分上并行执行映射。

This does notwork because you don't define columns like that in dask.

这确实不工作，因为你没有定义一样，在DASK列。

ddf["infos"] = ddf2["numbers"].map_partitions(lambda nr: custom_map(nr, hashmap))

Does anyone know how I can use columns here? I don't understand their API documentationat all.

有谁知道我如何在这里使用列？我根本不了解他们的API 文档。

Answer 1

回答by MRocklin

You can use the .mapmethod, exactly as in Pandas

您可以使用.map方法，就像在 Pandas 中一样

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]: df = pd.DataFrame({'x': [1, 2, 3]})

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: df.x.map(lambda x: x + 1)
Out[5]: 
0    2
1    3
2    4
Name: x, dtype: int64

In [6]: ddf.x.map(lambda x: x + 1).compute()
Out[6]: 
0    2
1    3
2    4
Name: x, dtype: int64

Metadata

元数据

You may be asked to provide a meta=keyword. This lets dask.dataframe know the output name and type of your function. Copying the docstring from map_partitionshere:

可能会要求您提供meta=关键字。这让 dask.dataframe 知道函数的输出名称和类型。从map_partitions这里复制文档字符串：

meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional

An empty pd.DataFrame or pd.Series that matches the dtypes and 
column names of the output. This metadata is necessary for many 
algorithms in dask dataframe to work. For ease of use, some 
alternative inputs are also available. Instead of a DataFrame, 
a dict of {name: dtype} or iterable of (name, dtype) can be 
provided. Instead of a series, a tuple of (name, dtype) can be 
used. If not provided, dask will try to infer the metadata. 
This may lead to unexpected results, so providing meta is  
recommended. 

For more information, see dask.dataframe.utils.make_meta.

So in the example above, where my output will be a series with name 'x'and dtype intI can do either of the following to be more explicit

因此，在上面的示例中，我的输出将是一个带有 name'x'和 dtype的系列，int我可以执行以下任一操作以更加明确

>>> ddf.x.map(lambda x: x + 1, meta=('x', int))

or

或者

>>> ddf.x.map(lambda x: x + 1, meta=pd.Series([], dtype=int, name='x'))

This tells dask.dataframe what to expect from our function. If no meta is given then dask.dataframe will try running your function on a little piece of data. It will raise an error asking for help if this fails.

这告诉 dask.dataframe 可以从我们的函数中得到什么。如果没有给出元数据，那么 dask.dataframe 将尝试在一小段数据上运行你的函数。如果失败，它将引发一个错误，寻求帮助。

pandas 如何用dask映射一列

提问by wishi

回答by MRocklin

Metadata

元数据

相关推荐

最近更新

标签

pandas 如何用dask映射一列

提问by wishi

回答by MRocklin

Metadata

元数据

相关推荐

pandas MultiIndex Slicing 要求索引完全lexsorted

在 Pandas 中，如何从基于另一个数据帧的数据帧中删除行？

pandas 如何在Python中计算groupby中的计数和百分比

python-pandas-检查数据框中是否存在日期

相关推荐

最近更新

标签