Python 比较熊猫数据框的行(行有一些重叠的值)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16533421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 22:56:58  来源:igfitidea点击:

Comparing rows of pandas dataframe (rows have some overlapping values)

pythonpandasdataframe

提问by mlimb

I have a pandas dataframe with 21 columns. I am focusing on a subset of rows that have exactly same column data values except for 6 that are unique to each row. I don't know which column headings these 6 values correspond to a priori.

我有一个包含 21 列的 Pandas 数据框。我关注的是具有完全相同列数据值的行的子集,除了每行唯一的 6 个行。我不知道这 6 个值对应于哪个列标题。

I tried converting each row to Index objects, and performed set operation on two rows. Ex.

我尝试将每一行转换为 Index 对象,并对两行执行设置操作。前任。

row1 = pd.Index(sample_data[0])
row2 = pd.Index(sample_data[1])
row1 - row2 

which returns an Index object containing values unique to row1. Then I can manually deduce which columns have unique values.

它返回一个 Index 对象,其中包含对 row1 唯一的值。然后我可以手动推断哪些列具有唯一值。

How can I programmatically grab the column headings that these values correspond to in the initial dataframe? Or, is there a way to compare two or multiple dataframe rows and extract the 6 different column values of each row, as well as the corresponding headings? Ideally, it would be nice to generate a new dataframe with the unique columns.

如何以编程方式获取这些值在初始数据框中对应的列标题?或者,有没有办法比较两个或多个数据框行并提取每行的 6 个不同列值以及相应的标题?理想情况下,生成具有唯一列的新数据框会很好。

In particular, is there a way to do this using set operations?

特别是,有没有办法使用集合操作来做到这一点?

Thank you.

谢谢你。

回答by Garrett

Here's a quick solution to return only the columns in which the first two rows differ.

这是仅返回前两行不同的列的快速解决方案。

In [13]: df = pd.DataFrame(zip(*[range(5), list('abcde'), list('aaaaa'),
...                              list('bbbbb')]), columns=list('ABCD'))

In [14]: df
Out[14]: 
   A  B  C  D
0  0  a  a  b
1  1  b  a  b
2  2  c  a  b
3  3  d  a  b
4  4  e  a  b

In [15]: df[df.columns[df.iloc[0] != df.iloc[1]]]
Out[15]: 
   A  B
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e

And a solution to find all columns with more than one unique value throughout the entire frame.

以及在整个框架中查找具有多个唯一值的所有列的解决方案。

In [33]: df[df.columns[df.apply(lambda s: len(s.unique()) > 1)]]
Out[33]: 
   A  B
0  0  a
1  1  b
2  2  c
3  3  d
4  4  e

回答by Jeff Tratner

You don't really need the index, you could just compare two rows and use that to filter the columns with a list comprehension.

您实际上并不需要索引,您可以只比较两行并使用它来过滤具有列表理解的列。

df = pd.DataFrame({"col1": np.ones(10), "col2": np.ones(10), "col3": range(2,12)})
row1 = df.irow(0)
row2 = df.irow(1)
unique_columns = row1 != row2
cols = [colname for colname, unique_column in zip(df.columns, bools) if unique_column]
print cols # ['col3']

If you know the standard value for each column, you can convert all the rows to a list of booleans, i.e.:

如果您知道每列的标准值,则可以将所有行转换为布尔值列表,即:

standard_row = np.ones(3)
columns = df.columns
unique_columns = df.apply(lambda x: x != standard_row, axis=1)
unique_columns.apply(lambda x: [col for col, unique_column in zip(columns, x) if unique_column], axis=1)