pandas 如何在pandas数据框中的所有列中获取唯一值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/48257889/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to get unique values in all columns in pandas data frame
提问by Kumar AK
I want to list out all the unique values in all columns in a Pandas dataframe and store them in another data frame. I have tried this but its appending row wise and I want it column wise. How do I do that?
我想列出 Pandas 数据框中所有列中的所有唯一值,并将它们存储在另一个数据框中。我已经尝试过这个,但它的附加行是明智的,我希望它是明智的。我怎么做?
raw_data = {'student_name': ['Miller', 'Miller', 'Ali', 'Miller'],
'test_score': [76, 75,74,76]}
df2 = pd.DataFrame(raw_data, columns = ['student_name', 'test_score'])
newDF = pd.DataFrame()
for column in df2.columns[0:]:
dat = df2[column].drop_duplicates()
df3 = pd.DataFrame(dat)
newDF = newDF.append(df3)
print(newDF)
Expected Output:
student_name test_score
Ali 74
Miller 75
76
回答by jezrael
I think you can use drop_duplicates
.
我认为你可以使用drop_duplicates
.
If want check some column(s) and keep first rows if dupe:
如果要检查某些列并在欺骗时保留第一行:
newDF = df2.drop_duplicates('student_name')
print(newDF)
student_name test_score
0 Miller 76.0
1 Jacobson 88.0
2 Ali 84.0
3 Milner 67.0
4 Cooze 53.0
5 Jacon 96.0
6 Ryaner 64.0
7 Sone 91.0
8 Sloan 77.0
9 Piger 73.0
10 Riani 52.0
And thank you, @c???s???? for another solution:
谢谢,@c???s???? 对于另一种解决方案:
df2[~df2.student_name.duplicated()]
But if want check all columns together for dupes, keep first rows:
但是,如果要一起检查所有列是否有重复,请保留第一行:
newDF = df2.drop_duplicates()
print(newDF)
student_name test_score
0 Miller 76.0
1 Jacobson 88.0
2 Ali 84.0
3 Milner 67.0
4 Cooze 53.0
5 Jacon 96.0
6 Ryaner 64.0
7 Sone 91.0
8 Sloan 77.0
9 Piger 73.0
10 Riani 52.0
11 Ali NaN
EDIT by new sample - remove duplicates and sort by both columns:
按新样本编辑 - 删除重复项并按两列排序:
newDF = df2.drop_duplicates().sort_values(['student_name', 'test_score'])
print(newDF)
student_name test_score
2 Ali 74
1 Miller 75
0 Miller 76
EDIT1: If want replace dupes by first column by NaN
s:
EDIT1:如果想用NaN
s替换第一列的dupes :
newDF = df2.drop_duplicates().sort_values(['student_name', 'test_score'])
newDF['student_name'] = newDF['student_name'].mask(newDF['student_name'].duplicated())
print(newDF)
student_name test_score
2 Ali 74
1 Miller 75
0 NaN 76
EDIT2: More general solution:
EDIT2:更通用的解决方案:
newDF = df2.sort_values(df2.columns.tolist())
.reset_index(drop=True)?
?.apply(lambda x: x.drop_duplicates())