Python 如何在熊猫的多个数据框列中“选择不同”？

Question

提问by Jody

I'm looking for a way to do the equivalent to the SQL

我正在寻找一种与 SQL 等效的方法

SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn't have anything about distinct.

pandas sql 比较没有任何关于distinct.

.unique()only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.

.unique()仅适用于单列，所以我想我可以连接列，或者将它们放在列表/元组中并以这种方式进行比较，但这似乎是熊猫应该以更原生的方式做的事情。

Am I missing something obvious, or is there no way to do this?

我是否遗漏了一些明显的东西，或者没有办法做到这一点？

Answer 1

采纳答案by joris

You can use the drop_duplicatesmethod to get the unique rows in a DataFrame:

您可以使用该drop_duplicates方法获取 DataFrame 中的唯一行：

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subsetkeyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

subset如果您只想使用某些列来确定唯一性，也可以提供关键字参数。请参阅文档字符串。

Answer 2

回答by EdChum

There is no uniquemethod for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique)but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

没有uniquedf 的方法，如果每列的唯一值的数量相同，那么以下将起作用：df.apply(pd.Series.unique)但如果不是，那么您将收到错误。另一种方法是将值存储在以列名为键的字典中：

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

Answer 3

回答by dorante

You can take the sets of the columns and just subtract the smaller set from the larger set:

您可以获取列的集合，然后从较大的集合中减去较小的集合：

distinct_values = set(df['a'])-set(df['b'])

Answer 4

回答by Naiara Andrade

I think use drop duplicatesometimes will not so useful depending dataframe.

我认为使用drop duplicate有时不会那么有用，具体取决于数据框。

I found this:

我找到了这个：

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And work for me!

并为我工作！

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

Answer 5

回答by Yury Wallet

I've tried different solutions. First was:

我尝试了不同的解决方案。首先是：

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

它适用于非对象数据另一种方法来做到这一点并避免错误（对于对象列类型）是应用 drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

您也可以使用 SQL 来执行此操作，但在我的情况下它的运行速度非常慢：

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

Answer 6

回答by ncoghlan

To solve a similar problem, I'm using groupby:

为了解决类似的问题，我正在使用groupby：

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCTas shown).

不过，这是否合适取决于您想对结果做什么（在我的情况下，我只想要COUNT DISTINCT如图所示的等价物）。

Python 如何在熊猫的多个数据框列中“选择不同”？

提问by Jody

采纳答案by joris

回答by EdChum

回答by dorante

回答by Naiara Andrade

回答by Yury Wallet

回答by ncoghlan

相关推荐

最近更新

标签

Python 如何在熊猫的多个数据框列中“选择不同”？

提问by Jody

采纳答案by joris

回答by EdChum

回答by dorante

回答by Naiara Andrade

回答by Yury Wallet

回答by ncoghlan

相关推荐

Firefox 中的 IPython Notebook 中是否有等效于 CTRL+C 来中断正在运行的单元格？

Python 在每列中查找 DataFrame 中不同元素的计数

如何等到 Python 中只有第一个线程完成

Python 打开 cv 错误：(-215) scn == 3 || 函数 cvtColor 中的 scn == 4

相关推荐

最近更新

标签