pandas 如何选择数据框中大于给定值的所有元素

Question

提问by Adi

I have a csv that is read by my python code and a dataframe is created using pandas.

我有一个由我的 python 代码读取的 csv，一个数据框是使用 Pandas 创建的。

CSV file is in following format

CSV 文件采用以下格式

My code calculates the percentile and wants to find all rows that have the value in 2nd column greater than 60.

我的代码计算百分位数，并希望找到第二列中值大于 60 的所有行。

df = pd.read_csv(io.BytesIO(body), error_bad_lines=False, header=None, encoding='latin1', sep=',')

percentile = df.iloc[:, 1:2].quantile(0.99)  # Selecting 2nd column and calculating percentile

criteria = df[df.iloc[:, 1:2] >= 60.0]

While my percentile code works fine, criteria to find all rows that have column 2's value greater than 60 returns

虽然我的百分位代码工作正常，但查找第 2 列值大于 60 的所有行的条件返回

NaN     NaN
NaN     NaN
NaN     NaN
NaN     NaN

Can you please help me find the error.

你能帮我找出错误吗？

Answer 1

回答by GianAnge

Just correct the condition inside criteria. Being the second column "1" you should write df.iloc[:,1].
Example:

只需更正标准内的条件即可。作为第二列“1”，您应该编写 df.iloc[:,1]。
例子：

import pandas as pd
import numpy as np
b =np.array([[1,2,3,7], [1,99,20,63] ])

df = pd.DataFrame(b.T) #just creating the dataframe


criteria = df[ df.iloc[:,1]>= 60 ]     
print(criteria)

Why? It seems like the cause resides inside the definition type of the condition. Let's inspect

为什么？原因似乎存在于条件的定义类型中。让我们检查一下

Case 1:

情况1：

type( df.iloc[:,1]>= 60 )

Returns pandas.core.series.Series,
so it gives

返回pandas.core.series.Series，
所以它给出

 df[ df.iloc[:,1]>= 60 ]

 #out:
   0   1
1  2  99
3  7  63

Case2:

案例2：

type( df.iloc[:,1:2]>= 60 )

Returns a pandas.core.frame.DataFrame
, and gives

返回一个 pandas.core.frame.DataFrame
，并给出

df[ df.iloc[:,1:2]>= 60 ]

#out:
    0     1
0 NaN   NaN
1 NaN  99.0
2 NaN   NaN
3 NaN  63.0

Therefore I think it changes the way the index is processed.
Always keep in mind that 3 is a scalar, and 3:4 is a array.

因此我认为它改变了索引的处理方式。
永远记住3 是一个标量，而 3:4 是一个数组。

For more info is always good to take a look at the official doc Pandas indexing

有关更多信息，请查看官方文档Pandas indexing

Answer 2

回答by An economist

Your indexing a bit off, since you only have two columns [0, 1]and you are interested in selecting just the one with index 1. As @applesoupmentioned the following is just enough:

您的索引有点偏离，因为您只有两列[0, 1]并且您只想选择带有 index 的列1。至于@applesoup提到以下是刚够：

criteria = df[df.iloc[:, 1] >= 60.0]

However, I would consider naming columns and just referencing based on name. This will allow you to avoid any mistakes in case your dfstructure changes, e.g.:

但是，我会考虑命名列并仅根据名称进行引用。如果您的df结构发生变化，这将允许您避免任何错误，例如：

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3, 7], 'b': [1.0, 99.0, 20.0, 63.]})

criteria = df[df['b'] >= 60.0]

Answer 3

回答by Neroksi

People here seem to be more interested in coming up with alternative solutions instead of digging into his code in order to find out what's really wrong. I will adopt a diametrically opposed strategy!

这里的人们似乎对提出替代解决方案更感兴趣，而不是深入研究他的代码以找出真正的问题所在。我会采取截然相反的策略！

The problem with your code is that you are indexing your DataFrame dfby another DataFrame. Why? Because you use slicesinstead of integer indexing.

您的代码的问题在于您正在df通过另一个 DataFrame索引您的 DataFrame。为什么？因为您使用slices而不是整数索引。

df.iloc[:, 1:2] >= 60.0 # Return a DataFrame with one boolean column
df.iloc[:, 1] >= 60.0 # Return a Series
df.iloc[:, [1]] >= 60.0 # Return a DataFrame with one boolean column

So correct your code by using :

因此，请使用以下方法更正您的代码：

criteria = df[df.iloc[:, 1] >= 60.0] # Dont slice !

pandas 如何选择数据框中大于给定值的所有元素

提问by Adi

回答by GianAnge

回答by An economist

回答by Neroksi

相关推荐

最近更新

标签

pandas 如何选择数据框中大于给定值的所有元素

提问by Adi

回答by GianAnge

回答by An economist

回答by Neroksi

相关推荐

如何在 Pandas 中使用 read_excel 提高处理速度？

Pandas 查询功能不适用于列名中的空格

pandas sort_values() 得到了一个意外的关键字参数“by”

pandas 将包含列表的列拆分为熊猫中的不同行

相关推荐

最近更新

标签