pandas 如何在 dask DataFrame 上调用 unique()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/40848443/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to call unique() on dask DataFrame
提问by femibyte
How do I call unique on a dask DataFrame ?
如何在 dask DataFrame 上调用 unique ?
I get the following error if I try to call it the same way as for a regular pandas dataframe:
如果我尝试以与常规 Pandas 数据帧相同的方式调用它,我会收到以下错误:
In [27]: len(np.unique(ddf[['col1','col2']].values))
AttributeError Traceback (most recent call last)
<ipython-input-27-34c0d3097aab> in <module>()
----> 1 len(np.unique(ddf[['col1','col2']].values))
/dir/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in __getattr__(self, key)
1924 return self._constructor_sliced(merge(self.dask, dsk), name,
1925 meta, self.divisions)
-> 1926 raise AttributeError("'DataFrame' object has no attribute %r" % key)
1927
1928 def __dir__(self):
AttributeError: 'DataFrame' object has no attribute 'values'
采纳答案by MRocklin
For both Pandas and Dask.dataframe you should use the drop_duplicates method
对于 Pandas 和 Dask.dataframe,您应该使用 drop_duplicates 方法
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'x': [1, 1, 2], 'y': [10, 10, 20]})
In [3]: df.drop_duplicates()
Out[3]:
x y
0 1 10
2 2 20
In [4]: import dask.dataframe as dd
In [5]: ddf = dd.from_pandas(df, npartitions=2)
In [6]: ddf.drop_duplicates().compute()
Out[6]:
x y
0 1 10
2 2 20
回答by cggarvey
I'm not too familiar with Dask, but they appear to have a subset of Pandas functionality, and that subset doesn't seem to include the DataFrame.values attribute.
我对 Dask 不太熟悉,但它们似乎具有 Pandas 功能的子集,并且该子集似乎不包含 DataFrame.values 属性。
http://dask.pydata.org/en/latest/dataframe-api.html
http://dask.pydata.org/en/latest/dataframe-api.html
You could try this:
你可以试试这个:
sum(ddf[['col1','col2']].apply(pd.Series.nunique, axis=0))
I don't know how it fares performance-wise, but it should provide you with the value (total number of distinct values in col1 and col2 from the ddf DataFrame).
我不知道它在性能方面的表现如何,但它应该为您提供值(来自 ddf DataFrame 的 col1 和 col2 中不同值的总数)。