Python 如何在 IQR 中使用 Pandas 过滤器?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34782063/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:30:48  来源:igfitidea点击:

how to use pandas filter with IQR?

pythonpandasdata-processingiqr

提问by Qijun Liu

Is there a built-in way to do filtering on a column by IQR(i.e. values between Q1-1.5IQR and Q3+1.5IQR)? also, any other possible generalized filtering in pandas suggested will be appreciated.

是否有内置的方法可以通过 IQR(即 Q1-1.5IQR 和 Q3+1.5IQR 之间的值)对列进行过滤?此外,将不胜感激建议的任何其他可能的大熊猫通用过滤。

回答by Romain

As far as I know, the most compact notation seems to be brought by the querymethod.

据我所知,最简洁的符号似乎是query方法带来的。

# Some test data
np.random.seed(33454)
df = (
    # A standard distribution
    pd.DataFrame({'nb': np.random.randint(0, 100, 20)})
        # Adding some outliers
        .append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))
        # Reseting the index
        .reset_index(drop=True)
    )

# Computing IQR
Q1 = df['nb'].quantile(0.25)
Q3 = df['nb'].quantile(0.75)
IQR = Q3 - Q1

# Filtering Values between Q1-1.5IQR and Q3+1.5IQR
filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)')

Then we can plot the result to check the difference. We observe that the outlier in the left boxplot (the cross at 183) does not appear anymore in the filtered series.

然后我们可以绘制结果以检查差异。我们观察到左侧箱线图中的异常值(183 处的十字)不再出现​​在过滤后的系列中。

# Ploting the result to check the difference
df.join(filtered, rsuffix='_filtered').boxplot()

Comparison before and after filterinf

filterinf 前后对比

Since this answer I've written a poston this topic were you may find more information.

由于这个答案,我写了一篇关于这个主题的帖子,你可能会找到更多信息。

回答by kd88

This will give you the subset of dfwhich lies in the IQR of column column:

这将为您df提供列的 IQR 中的子集column

def subset_by_iqr(df, column, whisker_width=1.5):
    """Remove outliers from a dataframe by column, including optional 
       whiskers, removing rows for which the column value are 
       less than Q1-1.5IQR or greater than Q3+1.5IQR.
    Args:
        df (`:obj:pd.DataFrame`): A pandas dataframe to subset
        column (str): Name of the column to calculate the subset from.
        whisker_width (float): Optional, loosen the IQR filter by a
                               factor of `whisker_width` * IQR.
    Returns:
        (`:obj:pd.DataFrame`): Filtered dataframe
    """
    # Calculate Q1, Q2 and IQR
    q1 = df[column].quantile(0.25)                 
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    # Apply filter with respect to IQR, including optional whiskers
    filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr)
    return df.loc[filter]                                                     

# Example for whiskers = 1.5, as requested by the OP
df_filtered = subset_by_iqr(df, 'column_name', whisker_width=1.5)

回答by MLenthousiast

Another approach using Series.between():

另一种使用 Series.between() 的方法:

iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)]

Drawn out:

抽出:

q1 = df['col'].quantile(.25)
q3 = df['col'].quantile(.75)
mask = d['col'].between(q1, q2, inclusive=True)
iqr = d.loc[mask, 'col']

回答by Ayyasamy

You can try using the below code, also, by calculating IQR. Based on the IQR, lower and upper bound, it will replace the value of outliers presented in each column. this code will go through each columns in data-frame and work one by one by filtering the outliers alone, instead of going through all the values in rows for finding outliers.

您也可以通过计算 IQR 来尝试使用以下代码。基于IQR、下限和上限,它将替换每列中呈现的异常值的值。此代码将遍历数据框中的每一列,并通过单独过滤异常值来一一工作,而不是遍历行中的所有值以查找异常值。

Function:

功能:

    def mod_outlier(df):
        df1 = df.copy()
        df = df._get_numeric_data()


        q1 = df.quantile(0.25)
        q3 = df.quantile(0.75)

        iqr = q3 - q1

        lower_bound = q1 -(1.5 * iqr) 
        upper_bound = q3 +(1.5 * iqr)


        for col in col_vals:
            for i in range(0,len(df[col])):
                if df[col][i] < lower_bound[col]:            
                    df[col][i] = lower_bound[col]

                if df[col][i] > upper_bound[col]:            
                    df[col][i] = upper_bound[col]    


        for col in col_vals:
            df1[col] = df[col]

        return(df1)

Function call:

函数调用:

df = mod_outlier(df)

回答by danek

Another approach uses Series.clip:

另一种方法使用 Series.clip:

q = s.quantile([.25, .75])
s = s[~s.clip(*q).isin(q)]

here are details:

以下是详细信息:

s = pd.Series(np.randon.randn(100))
q = s.quantile([.25, .75])  # calculate lower and upper bounds
s = s.clip(*q)  # assigns values outside boundary to boundary values
s = s[~s.isin(q)]  # take only observations within bounds

Using it to filter a whole dataframe dfis straightforward:

使用它来过滤整个数据框df很简单:

def iqr(df, colname, bounds = [.25, .75]):
    s = df[colname]
    q = s.quantile(bounds)
    return df[~s.clip(*q).isin(q)]

Note: the method excludes the boundaries themselves.

注意:该方法不包括边界本身。