pandas 如何选择数据框中大于给定值的所有元素

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50865987/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:42:10  来源:igfitidea点击:

How to select all elements greater than a given values in a dataframe

pythonpandas

提问by Adi

I have a csv that is read by my python code and a dataframe is created using pandas.

我有一个由我的 python 代码读取的 csv,一个数据框是使用 Pandas 创建的。

CSV file is in following format

CSV 文件采用以下格式

1     1.0
2     99.0
3     20.0
7     63

My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.

我的代码计算百分位数,并希望找到第二列中值大于 60 的所有行。

df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')

percentile = df.iloc[:, 1:2].quantile(0.99)  # Selecting 2nd column and calculating percentile

criteria = df[df.iloc[:, 1:2] >= 60.0]

While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns

虽然我的百分位代码工作正常,但查找第 2 列值大于 60 的所有行的条件返回

NaN     NaN
NaN     NaN
NaN     NaN
NaN     NaN

Can you please help me find the error.

你能帮我找出错误吗?

回答by GianAnge

Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:

只需更正标准内的条件即可。作为第二列“1”,您应该编写 df.iloc[:,1]。
例子:

import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])

df = pd.DataFrame(b.T) #just creating the dataframe


criteria = df[ df.iloc[:,1]>= 60 ]     
print(criteria)

Why? It seems like the cause resides inside the definition type of the condition. Let's inspect

为什么?原因似乎存在于条件的定义类型中。让我们检查一下

Case 1:

情况1:

type( df.iloc[:,1]>= 60 )

Returns pandas.core.series.Series,
so it gives

返回pandas.core.series.Series
所以它给出

 df[ df.iloc[:,1]>= 60 ]

 #out:
   0   1
1  2  99
3  7  63

Case2:

案例2:

type( df.iloc[:,1:2]>= 60 )

Returns a pandas.core.frame.DataFrame
, and gives

返回一个 pandas.core.frame.DataFrame
,并给出

df[ df.iloc[:,1:2]>= 60 ]

#out:
    0     1
0 NaN   NaN
1 NaN  99.0
2 NaN   NaN
3 NaN  63.0

Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.

因此我认为它改变了索引的处理方式。
永远记住3 是一个标量,而 3:4 是一个数组

For more info is always good to take a look at the official doc Pandas indexing

有关更多信息,请查看官方文档Pandas indexing

回答by An economist

Your indexing a bit off, since you only have two columns [0, 1]and you are interested in selecting just the one with index 1. As @applesoupmentioned the following is just enough:

您的索引有点偏离,因为您只有两列[0, 1]并且您只想选择带有 index 的列1。至于@applesoup提到以下是刚够:

criteria = df[df.iloc[:, 1] >= 60.0]

However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your dfstructure changes, e.g.:

但是,我会考虑命名列并仅根据名称进行引用。如果您的df结构发生变化,这将允许您避免任何错误,例如:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})

criteria = df[df['b'] >= 60.0]

回答by Neroksi

People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!

这里的人们似乎对提出替代解决方案更感兴趣,而不是深入研究他的代码以找出真正的问题所在。我会采取截然相反的策略!

The problem with your code is that you are indexing your DataFrame dfby another DataFrame. Why? Because you use slicesinstead of integer indexing.

您的代码的问题在于您正在df通过另一个 DataFrame索引您的 DataFrame。为什么?因为您使用slices而不是整数索引。

df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column

So correct your code by using :

因此,请使用以下方法更正您的代码:

criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !