pandas 如何在 dask DataFrame 上调用 unique()

Question

提问by femibyte

How do I call unique on a dask DataFrame ?

如何在 dask DataFrame 上调用 unique ？

I get the following error if I try to call it the same way as for a regular pandas dataframe:

如果我尝试以与常规 Pandas 数据帧相同的方式调用它，我会收到以下错误：

In [27]: len(np.unique(ddf[['col1','col2']].values))

AttributeError                            Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))

/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924             return self._constructor_sliced(merge(self.dask, dsk), name,
1925                                             meta, self.divisions)
-> 1926         raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928     def __dir__(self):

AttributeError: 'DataFrame' object has no attribute 'values'

Answer 1

采纳答案by MRocklin

For both Pandas and Dask.dataframe you should use the drop_duplicates method

对于 Pandas 和 Dask.dataframe，您应该使用 drop_duplicates 方法

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})

In [3]: df.drop_duplicates()
Out[3]: 
   x   y
0  1  10
2  2  20

In [4]: import dask.dataframe as dd

In [5]: ddf = dd.from_pandas(df, npartitions=2)

In [6]: ddf.drop_duplicates().compute()
Out[6]: 
   x   y
0  1  10
2  2  20

Answer 2

回答by cggarvey

I'm not too familiar with Dask, but they appear to have a subset of Pandas functionality, and that subset doesn't seem to include the DataFrame.values attribute.

我对 Dask 不太熟悉，但它们似乎具有 Pandas 功能的子集，并且该子集似乎不包含 DataFrame.values 属性。

http://dask.pydata.org/en/latest/dataframe-api.html

You could try this:

你可以试试这个：

sum(ddf[['col1','col2']].apply(pd.Series.nunique, axis=0))

I don't know how it fares performance-wise, but it should provide you with the value (total number of distinct values in col1 and col2 from the ddf DataFrame).

我不知道它在性能方面的表现如何，但它应该为您提供值（来自 ddf DataFrame 的 col1 和 col2 中不同值的总数）。

pandas 如何在 dask DataFrame 上调用 unique()

提问by femibyte

采纳答案by MRocklin

回答by cggarvey

相关推荐

最近更新

标签

pandas 如何在 dask DataFrame 上调用 unique()

提问by femibyte

采纳答案by MRocklin

回答by cggarvey

相关推荐

在 Pandas 中用 NaN 替换空字符串

pandas 将熊猫数据框中的整数二进制化

pandas Python Panda 错误类型错误：不支持 / 的操作数类型：'str' 和 'int'

pandas seaborn 热图的人工刻度标签

相关推荐

最近更新

标签