pandas 比较两列数据框中的值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28124187/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
compare values in two columns of data frame
提问by user3282777
I have the following two columns in pandas data frame
我在Pandas数据框中有以下两列
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
There are around 1594 rows. '256' and 'Z' are column headers whereas 0,1,2,3,4 are row numbers (1st column above). I want to print row numbers where value in Column '256' is not equal to values in column 'Z'. Thus output in the above case will be 1, 3. How can this comparison be made in pandas? I will be very grateful for help. Thanks.
大约有 1594 行。'256' 和 'Z' 是列标题,而 0,1,2,3,4 是行号(上面的第一列)。我想打印行号,其中“256”列中的值不等于“Z”列中的值。因此,上述情况下的输出将是 1、3。如何在 Pandas 中进行这种比较?我将非常感谢您的帮助。谢谢。
回答by cel
Create the data frame:
创建数据框:
import pandas as pd
df = pd.DataFrame({"256":[2,2,4,4], "Z": [2,3,4,9]})
ouput:
输出:
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
After subsetting your data frame, use the index to get the id of rows in the subset:
对数据框进行子集化后,使用索引获取子集中行的 id:
row_ids = df[df["256"] != df.Z].index
gives
给
Int64Index([1, 3], dtype='int64')
回答by aus_lacy
Another way could be to use the .locmethod of pandas.DataFramewhich returns the indexed location of the rows that qualify the boolean indexing:
另一种方法是使用返回符合布尔索引的行的索引位置的.loc方法pandas.DataFrame:
df.loc[(df['256'] != df['Z'])].index
with an output of:
输出为:
Int64Index([1, 3], dtype='int64')
This happens to be the quickest of the listed implementations as can be seen in ipython notebook:
这恰好是列出的实现中最快的,如下所示ipython notebook:
import pandas as pd
import numpy as np
df = pd.DataFrame({"256":np.random.randint(0,10,1594), "Z": np.random.randint(0,10,1594)})
%timeit df.loc[(df['256'] != df['Z'])].index
%timeit row_ids = df[df["256"] != df.Z].index
%timeit rows = list(df[df['256'] != df.Z].index)
%timeit df[df['256'] != df['Z']].index
with an output of:
输出为:
1000 loops, best of 3: 352 μs per loop
1000 loops, best of 3: 358 μs per loop
1000 loops, best of 3: 611 μs per loop
1000 loops, best of 3: 355 μs per loop
However, when it comes down to 5-10 microseconds it doesn't make a significant difference, but if in the future you have a very large data set timing and efficiency may become a much more important issue. For your relatively small data set of 1594 rows I would go with the solution that looks the most elegant and promotes the most readability.
然而,当它下降到 5-10 微秒时,它不会产生显着差异,但如果将来你有一个非常大的数据集,时间和效率可能会成为一个更重要的问题。对于 1594 行的相对较小的数据集,我会选择看起来最优雅并提高可读性的解决方案。
回答by rchang
You can try this:
你可以试试这个:
# Assuming your DataFrame is named "frame"
rows = list(frame[frame['256'] != frame.Z].index)
rowswill now be a list containing the row numbers for which those two column values are not equal. So with your data:
rows现在将是一个包含这两个列值不相等的行号的列表。所以用你的数据:
>>> frame
256 Z
0 2 2
1 2 3
2 4 4
3 4 9
[4 rows x 2 columns]
>>> rows = list(frame[frame['256'] != frame.Z].index)
>>> print(rows)
[1, 3]
回答by Primer
Assuming dfis your dataframe, this should do it:
假设df是您的数据框,则应该这样做:
df[df['256'] != df['Z']].index
yielding:
产生:
Int64Index([1, 3], dtype='int64')

