Python 如何在熊猫的多个数据框列中“选择不同”?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30530663/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to "select distinct" across multiple data frame columns in pandas?
提问by Jody
I'm looking for a way to do the equivalent to the SQL
我正在寻找一种与 SQL 等效的方法
SELECT DISTINCT col1, col2 FROM dataframe_table
The pandas sql comparison doesn't have anything about distinct
.
pandas sql 比较没有任何关于distinct
.
.unique()
only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.
.unique()
仅适用于单列,所以我想我可以连接列,或者将它们放在列表/元组中并以这种方式进行比较,但这似乎是熊猫应该以更原生的方式做的事情。
Am I missing something obvious, or is there no way to do this?
我是否遗漏了一些明显的东西,或者没有办法做到这一点?
采纳答案by joris
You can use the drop_duplicates
method to get the unique rows in a DataFrame:
您可以使用该drop_duplicates
方法获取 DataFrame 中的唯一行:
In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})
In [30]: df
Out[30]:
a b
0 1 3
1 2 4
2 1 3
3 2 5
In [32]: df.drop_duplicates()
Out[32]:
a b
0 1 3
1 2 4
3 2 5
You can also provide the subset
keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.
subset
如果您只想使用某些列来确定唯一性,也可以提供关键字参数。请参阅文档字符串。
回答by EdChum
There is no unique
method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique)
but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:
没有unique
df 的方法,如果每列的唯一值的数量相同,那么以下将起作用:df.apply(pd.Series.unique)
但如果不是,那么您将收到错误。另一种方法是将值存储在以列名为键的字典中:
In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
d[col] = df[col].unique()
d
Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}
回答by dorante
You can take the sets of the columns and just subtract the smaller set from the larger set:
您可以获取列的集合,然后从较大的集合中减去较小的集合:
distinct_values = set(df['a'])-set(df['b'])
回答by Naiara Andrade
I think use drop duplicate
sometimes will not so useful depending dataframe.
我认为使用drop duplicate
有时不会那么有用,具体取决于数据框。
I found this:
我找到了这个:
[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)
And work for me!
并为我工作!
https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe
https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe
回答by Yury Wallet
I've tried different solutions. First was:
我尝试了不同的解决方案。首先是:
a_df=np.unique(df[['col1','col2']], axis=0)
and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()
它适用于非对象数据 另一种方法来做到这一点并避免错误(对于对象列类型)是应用 drop_duplicates()
a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]
You can also use SQL to do this, but it worked very slow in my case:
您也可以使用 SQL 来执行此操作,但在我的情况下它的运行速度非常慢:
from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)
回答by ncoghlan
To solve a similar problem, I'm using groupby
:
为了解决类似的问题,我正在使用groupby
:
print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")
Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT
as shown).
不过,这是否合适取决于您想对结果做什么(在我的情况下,我只想要COUNT DISTINCT
如图所示的等价物)。