pandas 熊猫:平衡数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45839316/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas : balancing data
提问by dokondr
Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"
注意:此问题与此处的答案不同:“Pandas:在 groupby 之后对每个组进行采样”
Trying to figure out how to use pandas.DataFrame.sample
or any other function to balance this data:
试图弄清楚如何使用pandas.DataFrame.sample
或任何其他功能来平衡这些数据:
df[class].value_counts()
c1 9170
c2 5266
c3 4523
c4 2193
c5 1956
c6 1896
c7 1580
c8 1407
c9 1324
I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.
我需要获取每个类(c1、c2、.. c9)的随机样本,其中样本大小等于具有最小实例数的类的大小。在这个例子中,样本大小应该是类 c9 = 1324 的大小。
Any simple way to do this with Pandas?
有什么简单的方法可以用 Pandas 做到这一点?
Update
更新
To clarify my question, in the table above :
为了澄清我的问题,在上表中:
c1 9170
c2 5266
c3 4523
...
Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:
数字是 c1,c2,c3,... 类实例的计数,因此实际数据如下所示:
c1 'foo'
c2 'bar'
c1 'foo-2'
c1 'foo-145'
c1 'xxx-07'
c2 'zzz'
...
etc.
等等。
Update 2
更新 2
To clarify more:
澄清更多:
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
df = pd.DataFrame(d)
class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
5 c1 1
6 c1 1
7 c2 2
8 c3 3
9 c3 3
df['class'].value_counts()
c1 5
c2 3
c3 2
Name: class, dtype: int64
g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()))
class val
class
c1 6 c1 1
5 c1 1
c2 4 c2 2
1 c2 2
c3 9 c3 3
8 c3 3
Looks like this works. Main questions:
看起来这有效。主要问题:
How g.apply(lambda x: x.sample(g.size().min()))
works? I know what 'lambda` is, but:
如何g.apply(lambda x: x.sample(g.size().min()))
运作?我知道“lambda”是什么,但是:
- What is passed to
lambda
inx
in this case? - What is
g
ing.size()
? - Why output contains 6,5,4, 1,8,9 numbers? What do they mean?
- 什么是传递给
lambda
在x
这种情况下? - 什么
g
在g.size()
? - 为什么输出包含 6,5,4, 1,8,9 数字?他们的意思是什么?
回答by piRSquared
g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)
class val
0 c1 1
1 c1 1
2 c2 2
3 c2 2
4 c3 3
5 c3 3
Answers to your follow-up questions
回答您的后续问题
- The
x
in thelambda
ends up being a dataframe that is the subset ofdf
represented by the group. Each of these dataframes, one for each group, gets passed through thislambda
. g
is thegroupby
object. I placed it in a named variable because I planned on using it twice.df.groupby('class').size()
is an alternative way to dodf['class'].value_counts()
but since I was going togroupby
anyway, I might as well reuse the samegroupby
, use asize
to get the value counts... saves time.- Those numbers are the the index values from
df
that go with the sampling. I addedreset_index(drop=True)
to get rid of it.
- 的
x
在lambda
向上是一个数据帧是子集的端部df
通过基团表示。这些数据帧中的每一个,每个组一个,通过 this 传递lambda
。 g
是groupby
对象。我将它放在一个命名变量中,因为我计划使用它两次。df.groupby('class').size()
是一种替代方法,df['class'].value_counts()
但由于我groupby
无论如何都要这样做,我不妨重用相同的groupby
, 使用 asize
来获取值计数...节省时间。- 这些数字是来自
df
采样的索引值。我添加reset_index(drop=True)
了摆脱它。
回答by Samuel Nde
The above answer is correct but I would love to specify that the gabove is not a Pandas DataFrame
object which the user most likely wants. It is a pandas.core.groupby.groupby.DataFrameGroupBy
object. To see this, try calling head
on gand the result will be as shown below.
上面的答案是正确的,但我想说明上面的g不是Pandas DataFrame
用户最可能想要的对象。它是一个pandas.core.groupby.groupby.DataFrameGroupBy
对象。看到这一点,尝试调用head
上摹,结果将是如下图所示。
import pandas as pd
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
d = pd.DataFrame(d)
g = d.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
g.head()
>>> class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
5 c1 1
6 c1 1
7 c2 2
8 c3 3
9 c3 3
To fix this, we need to convert ginto a Pandas DataFrame
after grouping the data as follows:
为了解决这个问题,我们需要在对数据进行分组后将g转换为 a Pandas DataFrame
,如下所示:
g = d.groupby('class')
g = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))
Calling the head now yields:
现在调用 head 会产生:
g.head()
>>>class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
Which is most likely what the user wants.
这很可能是用户想要的。
回答by Jhon Intriago Thoth
This method get randomly k elements of each class.
该方法随机获取每个类的 k 个元素。
def sampling_k_elements(group, k=3):
if len(group) < k:
return group
return group.sample(k)
balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)
回答by Black Panter
"The following code works for undersampling of unbalanced classes but it's too much sorry for that.Try it! And also it works the same for upsampling problems! Good Luck!"
“以下代码适用于不平衡类的欠采样,但对此感到非常抱歉。试试吧!它也适用于上采样问题!祝你好运!”
Import required sampling libraries
导入所需的采样库
from sklearn.utils import resample
Define the majority and minority class
定义多数类和少数类
df_minority9 = df[df['class']=='c9']
df_majority1 = df[df['class']=='c1']
df_majority2 = df[df['class']=='c2']
df_majority3 = df[df['class']=='c3']
df_majority4 = df[df['class']=='c4']
df_majority5 = df[df['class']=='c5']
df_majority6 = df[df['class']=='c6']
df_majority7 = df[df['class']=='c7']
df_majority8 = df[df['class']=='c8']
Unndersample majority class
欠采样多数类
maj_class1 = resample(df_majority1,
replace=True,
n_samples=1324,
random_state=123)
maj_class2 = resample(df_majority2,
replace=True,
n_samples=1324,
random_state=123)
maj_class3 = resample(df_majority3,
replace=True,
n_samples=1324,
random_state=123)
maj_class4 = resample(df_majority4,
replace=True,
n_samples=1324,
random_state=123)
maj_class5 = resample(df_majority5,
replace=True,
n_samples=1324,
random_state=123)
maj_class6 = resample(df_majority6,
replace=True,
n_samples=1324,
random_state=123)
maj_class7 = resample(df_majority7,
replace=True,
n_samples=1324,
random_state=123)
maj_class8 = resample(df_majority8,
replace=True,
n_samples=1324,
random_state=123)
Combine minority class with undersampled majority class
将少数类与欠采样的多数类相结合
df=pd.concat([df_minority9,maj_class1,maj_class2,maj_class3,maj_class4, maj_class5,dmaj_class6,maj_class7,maj_class8])
Display new balanced class counts
显示新的平衡类计数
df['class'].value_counts()