Python 如何在 IQR 中使用 Pandas 过滤器?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34782063/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to use pandas filter with IQR?
提问by Qijun Liu
Is there a built-in way to do filtering on a column by IQR(i.e. values between Q1-1.5IQR and Q3+1.5IQR)? also, any other possible generalized filtering in pandas suggested will be appreciated.
是否有内置的方法可以通过 IQR(即 Q1-1.5IQR 和 Q3+1.5IQR 之间的值)对列进行过滤?此外,将不胜感激建议的任何其他可能的大熊猫通用过滤。
回答by Romain
As far as I know, the most compact notation seems to be brought by the query
method.
据我所知,最简洁的符号似乎是query
方法带来的。
# Some test data
np.random.seed(33454)
df = (
# A standard distribution
pd.DataFrame({'nb': np.random.randint(0, 100, 20)})
# Adding some outliers
.append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))
# Reseting the index
.reset_index(drop=True)
)
# Computing IQR
Q1 = df['nb'].quantile(0.25)
Q3 = df['nb'].quantile(0.75)
IQR = Q3 - Q1
# Filtering Values between Q1-1.5IQR and Q3+1.5IQR
filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)')
Then we can plot the result to check the difference. We observe that the outlier in the left boxplot (the cross at 183) does not appear anymore in the filtered series.
然后我们可以绘制结果以检查差异。我们观察到左侧箱线图中的异常值(183 处的十字)不再出现在过滤后的系列中。
# Ploting the result to check the difference
df.join(filtered, rsuffix='_filtered').boxplot()
Since this answer I've written a poston this topic were you may find more information.
由于这个答案,我写了一篇关于这个主题的帖子,你可能会找到更多信息。
回答by kd88
This will give you the subset of df
which lies in the IQR of column column
:
这将为您df
提供列的 IQR 中的子集column
:
def subset_by_iqr(df, column, whisker_width=1.5):
"""Remove outliers from a dataframe by column, including optional
whiskers, removing rows for which the column value are
less than Q1-1.5IQR or greater than Q3+1.5IQR.
Args:
df (`:obj:pd.DataFrame`): A pandas dataframe to subset
column (str): Name of the column to calculate the subset from.
whisker_width (float): Optional, loosen the IQR filter by a
factor of `whisker_width` * IQR.
Returns:
(`:obj:pd.DataFrame`): Filtered dataframe
"""
# Calculate Q1, Q2 and IQR
q1 = df[column].quantile(0.25)
q3 = df[column].quantile(0.75)
iqr = q3 - q1
# Apply filter with respect to IQR, including optional whiskers
filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr)
return df.loc[filter]
# Example for whiskers = 1.5, as requested by the OP
df_filtered = subset_by_iqr(df, 'column_name', whisker_width=1.5)
回答by MLenthousiast
Another approach using Series.between():
另一种使用 Series.between() 的方法:
iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)]
Drawn out:
抽出:
q1 = df['col'].quantile(.25)
q3 = df['col'].quantile(.75)
mask = d['col'].between(q1, q2, inclusive=True)
iqr = d.loc[mask, 'col']
回答by Ayyasamy
You can try using the below code, also, by calculating IQR. Based on the IQR, lower and upper bound, it will replace the value of outliers presented in each column. this code will go through each columns in data-frame and work one by one by filtering the outliers alone, instead of going through all the values in rows for finding outliers.
您也可以通过计算 IQR 来尝试使用以下代码。基于IQR、下限和上限,它将替换每列中呈现的异常值的值。此代码将遍历数据框中的每一列,并通过单独过滤异常值来一一工作,而不是遍历行中的所有值以查找异常值。
Function:
功能:
def mod_outlier(df):
df1 = df.copy()
df = df._get_numeric_data()
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3 - q1
lower_bound = q1 -(1.5 * iqr)
upper_bound = q3 +(1.5 * iqr)
for col in col_vals:
for i in range(0,len(df[col])):
if df[col][i] < lower_bound[col]:
df[col][i] = lower_bound[col]
if df[col][i] > upper_bound[col]:
df[col][i] = upper_bound[col]
for col in col_vals:
df1[col] = df[col]
return(df1)
Function call:
函数调用:
df = mod_outlier(df)
回答by danek
Another approach uses Series.clip:
另一种方法使用 Series.clip:
q = s.quantile([.25, .75])
s = s[~s.clip(*q).isin(q)]
here are details:
以下是详细信息:
s = pd.Series(np.randon.randn(100))
q = s.quantile([.25, .75]) # calculate lower and upper bounds
s = s.clip(*q) # assigns values outside boundary to boundary values
s = s[~s.isin(q)] # take only observations within bounds
Using it to filter a whole dataframe df
is straightforward:
使用它来过滤整个数据框df
很简单:
def iqr(df, colname, bounds = [.25, .75]):
s = df[colname]
q = s.quantile(bounds)
return df[~s.clip(*q).isin(q)]
Note: the method excludes the boundaries themselves.
注意:该方法不包括边界本身。