Python 如何在熊猫的多个数据框列中“选择不同”?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30530663/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:36:16  来源:igfitidea点击:

How to "select distinct" across multiple data frame columns in pandas?

pythonpandas

提问by Jody

I'm looking for a way to do the equivalent to the SQL

我正在寻找一种与 SQL 等效的方法

SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn't have anything about distinct.

pandas sql 比较没有任何关于distinct.

.unique()only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.

.unique()仅适用于单列,所以我想我可以连接列,或者将它们放在列表/元组中并以这种方式进行比较,但这似乎是熊猫应该以更原生的方式做的事情。

Am I missing something obvious, or is there no way to do this?

我是否遗漏了一些明显的东西,或者没有办法做到这一点?

采纳答案by joris

You can use the drop_duplicatesmethod to get the unique rows in a DataFrame:

您可以使用该drop_duplicates方法获取 DataFrame 中的唯一行:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subsetkeyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

subset如果您只想使用某些列来确定唯一性,也可以提供关键字参数。请参阅文档字符串

回答by EdChum

There is no uniquemethod for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique)but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

没有uniquedf 的方法,如果每列的唯一值的数量相同,那么以下将起作用:df.apply(pd.Series.unique)但如果不是,那么您将收到错误。另一种方法是将值存储在以列名为键的字典中:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

回答by dorante

You can take the sets of the columns and just subtract the smaller set from the larger set:

您可以获取列的集合,然后从较大的集合中减去较小的集合:

distinct_values = set(df['a'])-set(df['b'])

回答by Naiara Andrade

I think use drop duplicatesometimes will not so useful depending dataframe.

我认为使用drop duplicate有时不会那么有用,具体取决于数据框。

I found this:

我找到了这个:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And work for me!

并为我工作!

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

回答by Yury Wallet

I've tried different solutions. First was:

我尝试了不同的解决方案。首先是:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

它适用于非对象数据 另一种方法来做到这一点并避免错误(对于对象列类型)是应用 drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

您也可以使用 SQL 来执行此操作,但在我的情况下它的运行速度非常慢:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

回答by ncoghlan

To solve a similar problem, I'm using groupby:

为了解决类似的问题,我正在使用groupby

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCTas shown).

不过,这是否合适取决于您想对结果做什么(在我的情况下,我只想要COUNT DISTINCT如图所示的等价物)。