从 pandas.dataframe 中删除低频值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32511061/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Remove low frequency values from pandas.dataframe
提问by Gilaztdinov Rustam
How can I remove values from a column in pandas.DataFrame, that occurs rarely, i.e. with a low frequency? Example:
如何从 中的列中删除pandas.DataFrame很少出现的值,即频率较低?例子:
In [4]: df[col_1].value_counts()
Out[4]: 0 189096
1 110500
2 77218
3 61372
...
2065 1
2067 1
1569 1
dtype: int64
So, my question is: how to remove values like 2065, 2067, 1569and others? And how can I do this for ALL columns, that contain .value_counts()like this?
所以,我的问题是:如何删除像2065, 2067, 1569和其他人这样的价值观?我怎样才能对包含.value_counts()这样的所有列执行此操作?
UPDATE:About 'low' I mean values like 2065. This value occurs in col_11 (one) times and I want to remove values like this.
更新:关于“低”我的意思是像2065. 该值出现col_11(一)次,我想删除这样的值。
回答by thecircus
I see there are two ways you might want to do this.
我看到您可能想要通过两种方式来做到这一点。
For the entire DataFrame
对于整个 DataFrame
This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.
此方法删除整个 DataFrame 中不经常出现的值。我们可以在没有循环的情况下使用内置函数来加快速度。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
Column-by-column
逐列
This method removes the entries that occur infrequently in each column.
此方法删除每列中不常出现的条目。
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
for col in df.columns:
value_counts = df[col].value_counts() # Specific column
to_remove = value_counts[value_counts <= threshold].index
df[col].replace(to_remove, np.nan, inplace=True)
回答by Alexander
You probably don't want to remove the entire row in your DataFrame if only one column has values below your threshold, so I've simply removed these data points and replaced them with None.
如果只有一列的值低于您的阈值,您可能不想删除 DataFrame 中的整行,因此我只是删除了这些数据点并将它们替换为None.
I loop through each column and perform a value_countson each. I then get the index values for each items that occurs at or below the target threshold values. Finally, I use .locto locate these elements values in the column and then replace them with None.
我遍历每一列并对每一列执行 a value_counts。然后,我获取出现在目标阈值或低于目标阈值的每个项目的索引值。最后,我使用.loc在列中定位这些元素值,然后将它们替换为None.
df = pd.DataFrame({'A': ['a', 'b', 'b', 'c', 'c'],
'B': ['a', 'a', 'b', 'c', 'c'],
'C': ['a', 'a', 'b', 'b', 'c']})
>>> df
A B C
0 a a a
1 b a a
2 b b b
3 c c b
4 c c c
threshold = 1 # Remove items less than or equal to threshold
for col in df:
vc = df[col].value_counts()
vals_to_remove = vc[vc <= threshold].index.values
df[col].loc[df[col].isin(vals_to_remove)] = None
>>> df
A B C
0 None a a
1 b a a
2 b None b
3 c c b
4 c c None

