pandas 如何选择数据框中大于给定值的所有元素
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50865987/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to select all elements greater than a given values in a dataframe
提问by Adi
I have a csv that is read by my python code and a dataframe is created using pandas.
我有一个由我的 python 代码读取的 csv,一个数据框是使用 Pandas 创建的。
CSV file is in following format
CSV 文件采用以下格式
1 1.0
2 99.0
3 20.0
7 63
My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.
我的代码计算百分位数,并希望找到第二列中值大于 60 的所有行。
df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')
percentile = df.iloc[:, 1:2].quantile(0.99) # Selecting 2nd column and calculating percentile
criteria = df[df.iloc[:, 1:2] >= 60.0]
While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns
虽然我的百分位代码工作正常,但查找第 2 列值大于 60 的所有行的条件返回
NaN NaN
NaN NaN
NaN NaN
NaN NaN
Can you please help me find the error.
你能帮我找出错误吗?
回答by GianAnge
Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:
只需更正标准内的条件即可。作为第二列“1”,您应该编写 df.iloc[:,1]。
例子:
import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])
df = pd.DataFrame(b.T) #just creating the dataframe
criteria = df[ df.iloc[:,1]>= 60 ]
print(criteria)
Why?
It seems like the cause resides inside the definition type of the condition. Let's inspect
为什么?原因似乎存在于条件的定义类型中。让我们检查一下
Case 1:
情况1:
type( df.iloc[:,1]>= 60 )
Returns pandas.core.series.Series,
so it gives
返回pandas.core.series.Series,
所以它给出
df[ df.iloc[:,1]>= 60 ]
#out:
0 1
1 2 99
3 7 63
Case2:
案例2:
type( df.iloc[:,1:2]>= 60 )
Returns a pandas.core.frame.DataFrame
, and gives
返回一个 pandas.core.frame.DataFrame
,并给出
df[ df.iloc[:,1:2]>= 60 ]
#out:
0 1
0 NaN NaN
1 NaN 99.0
2 NaN NaN
3 NaN 63.0
Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.
因此我认为它改变了索引的处理方式。
永远记住3 是一个标量,而 3:4 是一个数组。
For more info is always good to take a look at the official doc Pandas indexing
有关更多信息,请查看官方文档Pandas indexing
回答by An economist
Your indexing a bit off, since you only have two columns [0, 1]
and you are interested in selecting just the one with index 1
. As @applesoup
mentioned the following is just enough:
您的索引有点偏离,因为您只有两列[0, 1]
并且您只想选择带有 index 的列1
。至于@applesoup
提到以下是刚够:
criteria = df[df.iloc[:, 1] >= 60.0]
However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your df
structure changes, e.g.:
但是,我会考虑命名列并仅根据名称进行引用。如果您的df
结构发生变化,这将允许您避免任何错误,例如:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})
criteria = df[df['b'] >= 60.0]
回答by Neroksi
People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!
这里的人们似乎对提出替代解决方案更感兴趣,而不是深入研究他的代码以找出真正的问题所在。我会采取截然相反的策略!
The problem with your code is that you are indexing your DataFrame df
by another DataFrame. Why? Because you use slices
instead of integer indexing.
您的代码的问题在于您正在df
通过另一个 DataFrame索引您的 DataFrame。为什么?因为您使用slices
而不是整数索引。
df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column
So correct your code by using :
因此,请使用以下方法更正您的代码:
criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !