pandas 如何在pandas数据框中的所有列中获取唯一值

Question

提问by Kumar AK

I want to list out all the unique values in all columns in a Pandas dataframe and store them in another data frame. I have tried this but its appending row wise and I want it column wise. How do I do that?

我想列出 Pandas 数据框中所有列中的所有唯一值，并将它们存储在另一个数据框中。我已经尝试过这个，但它的附加行是明智的，我希望它是明智的。我怎么做？

raw_data = {'student_name': ['Miller', 'Miller', 'Ali', 'Miller'], 
        'test_score': [76, 75,74,76]}
      df2 = pd.DataFrame(raw_data, columns = ['student_name', 'test_score'])


      newDF = pd.DataFrame() 

      for column in df2.columns[0:]:
          dat = df2[column].drop_duplicates()
          df3 = pd.DataFrame(dat)
          newDF = newDF.append(df3)

print(newDF)


Expected Output:
student_name  test_score
Ali          74
Miller       75
             76

Answer 1

回答by jezrael

I think you can use drop_duplicates.

我认为你可以使用drop_duplicates.

If want check some column(s) and keep first rows if dupe:

如果要检查某些列并在欺骗时保留第一行：

newDF = df2.drop_duplicates('student_name')
print(newDF)
   student_name  test_score
0        Miller        76.0
1      Jacobson        88.0
2           Ali        84.0
3        Milner        67.0
4         Cooze        53.0
5         Jacon        96.0
6        Ryaner        64.0
7          Sone        91.0
8         Sloan        77.0
9         Piger        73.0
10        Riani        52.0

And thank you, @c???s???? for another solution:

谢谢，@c???s???? 对于另一种解决方案：

df2[~df2.student_name.duplicated()]

But if want check all columns together for dupes, keep first rows:

但是，如果要一起检查所有列是否有重复，请保留第一行：

newDF = df2.drop_duplicates()
print(newDF)
   student_name  test_score
0        Miller        76.0
1      Jacobson        88.0
2           Ali        84.0
3        Milner        67.0
4         Cooze        53.0
5         Jacon        96.0
6        Ryaner        64.0
7          Sone        91.0
8         Sloan        77.0
9         Piger        73.0
10        Riani        52.0
11          Ali         NaN

EDIT by new sample - remove duplicates and sort by both columns:

按新样本编辑 - 删除重复项并按两列排序：

newDF = df2.drop_duplicates().sort_values(['student_name', 'test_score'])
print(newDF)
  student_name  test_score
2          Ali          74
1       Miller          75
0       Miller          76

EDIT1: If want replace dupes by first column by NaNs:

EDIT1：如果想用NaNs替换第一列的dupes ：

newDF = df2.drop_duplicates().sort_values(['student_name', 'test_score'])
newDF['student_name'] = newDF['student_name'].mask(newDF['student_name'].duplicated())
print(newDF)
  student_name  test_score
2          Ali          74
1       Miller          75
0          NaN          76

EDIT2: More general solution:

EDIT2：更通用的解决方案：

newDF = df2.sort_values(df2.columns.tolist())
           .reset_index(drop=True)?
           ?.apply(lambda x: x.drop_duplicates())

pandas 如何在pandas数据框中的所有列中获取唯一值

提问by Kumar AK

回答by jezrael

相关推荐

最近更新

标签

pandas 如何在pandas数据框中的所有列中获取唯一值

提问by Kumar AK

回答by jezrael

相关推荐

用 Pandas 编写单个 CSV 标头

pandas 保留数据框熊猫中的特定列

Pandas 合并 TypeError：“NoneType”类型的对象没有 len()

pandas 用于多个分隔符的熊猫 read_csv()

相关推荐

最近更新

标签