从 pandas.dataframe 中删除低频值

Question

提问by Gilaztdinov Rustam

How can I remove values from a column in pandas.DataFrame, that occurs rarely, i.e. with a low frequency? Example:

如何从中的列中删除pandas.DataFrame很少出现的值，即频率较低？例子：

In [4]: df[col_1].value_counts()

Out[4]: 0       189096
        1       110500
        2        77218
        3        61372
              ...
        2065         1
        2067         1
        1569         1
        dtype: int64

So, my question is: how to remove values like 2065, 2067, 1569and others? And how can I do this for ALL columns, that contain .value_counts()like this?

所以，我的问题是：如何删除像2065, 2067, 1569和其他人这样的价值观？我怎样才能对包含.value_counts()这样的所有列执行此操作？

UPDATE:About 'low' I mean values like 2065. This value occurs in col_11 (one) times and I want to remove values like this.

更新：关于“低”我的意思是像2065. 该值出现col_11（一）次，我想删除这样的值。

Answer 1

回答by thecircus

I see there are two ways you might want to do this.

我看到您可能想要通过两种方式来做到这一点。

For the entire DataFrame

对于整个 DataFrame

This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.

此方法删除整个 DataFrame 中不经常出现的值。我们可以在没有循环的情况下使用内置函数来加快速度。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame 
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)

Column-by-column

逐列

This method removes the entries that occur infrequently in each column.

此方法删除每列中不常出现的条目。

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
         columns = ['A', 'B'])

threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
    value_counts = df[col].value_counts() # Specific column 
    to_remove = value_counts[value_counts <= threshold].index
    df[col].replace(to_remove, np.nan, inplace=True)

Answer 2

回答by Alexander

You probably don't want to remove the entire row in your DataFrame if only one column has values below your threshold, so I've simply removed these data points and replaced them with None.

如果只有一列的值低于您的阈值，您可能不想删除 DataFrame 中的整行，因此我只是删除了这些数据点并将它们替换为None.

I loop through each column and perform a value_countson each. I then get the index values for each items that occurs at or below the target threshold values. Finally, I use .locto locate these elements values in the column and then replace them with None.

我遍历每一列并对每一列执行 a value_counts。然后，我获取出现在目标阈值或低于目标阈值的每个项目的索引值。最后，我使用.loc在列中定位这些元素值，然后将它们替换为None.

df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'], 
                   'B': ['a', 'a', 'b', 'c', 'c'], 
                   'C': ['a', 'a', 'b', 'b', 'c']})

>>> df
   A  B  C
0  a  a  a
1  b  a  a
2  b  b  b
3  c  c  b
4  c  c  c

threshold = 1  # Remove items less than or equal to threshold
for col in df:
    vc = df[col].value_counts()
    vals_to_remove = vc[vc <= threshold].index.values
    df[col].loc[df[col].isin(vals_to_remove)] = None

>>> df
      A     B     C
0  None     a     a
1     b     a     a
2     b  None     b
3     c     c     b
4     c     c  None

从 pandas.dataframe 中删除低频值

提问by Gilaztdinov Rustam

回答by thecircus

回答by Alexander

相关推荐

最近更新

标签

从 pandas.dataframe 中删除低频值

提问by Gilaztdinov Rustam

回答by thecircus

回答by Alexander

相关推荐

pandas 熊猫中的逐元素异或

pandas 包括 NaN 值的 python 熊猫直方图

使用 pandas 或 numpy 填充缺失的时间序列数据

pandas 如何在 Seaborn facetgrid 条形图上添加图例

相关推荐

最近更新

标签