pandas 熊猫:平衡数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45839316/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:18:04  来源:igfitidea点击:

Pandas : balancing data

pythonpandas

提问by dokondr

Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"

注意:此问题与此处的答案不同:“Pandas:在 groupby 之后对每个组进行采样”

Trying to figure out how to use pandas.DataFrame.sampleor any other function to balance this data:

试图弄清楚如何使用pandas.DataFrame.sample或任何其他功能来平衡这些数据:

df[class].value_counts()

c1    9170
c2    5266
c3    4523
c4    2193
c5    1956
c6    1896
c7    1580
c8    1407
c9    1324

I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.

我需要获取每个类(c1、c2、.. c9)的随机样本,其中样本大小等于具有最小实例数的类的大小。在这个例子中,样本大小应该是类 c9 = 1324 的大小。

Any simple way to do this with Pandas?

有什么简单的方法可以用 Pandas 做到这一点?

Update

更新

To clarify my question, in the table above :

为了澄清我的问题,在上表中:

c1    9170
c2    5266
c3    4523
...

Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:

数字是 c1,c2,c3,... 类实例的计数,因此实际数据如下所示:

c1 'foo'
c2 'bar'
c1 'foo-2'
c1 'foo-145'
c1 'xxx-07'
c2 'zzz'
...

etc.

等等。

Update 2

更新 2

To clarify more:

澄清更多:

d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
    }

df = pd.DataFrame(d)

    class   val
0   c1  1
1   c2  2
2   c1  1
3   c1  1
4   c2  2
5   c1  1
6   c1  1
7   c2  2
8   c3  3
9   c3  3

df['class'].value_counts()

c1    5
c2    3
c3    2
Name: class, dtype: int64

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()))

        class   val
class           
c1  6   c1  1
    5   c1  1
c2  4   c2  2  
    1   c2  2
c3  9   c3  3
    8   c3  3

Looks like this works. Main questions:

看起来这有效。主要问题:

How g.apply(lambda x: x.sample(g.size().min()))works? I know what 'lambda` is, but:

如何g.apply(lambda x: x.sample(g.size().min()))运作?我知道“lambda”是什么,但是:

  • What is passed to lambdain xin this case?
  • What is gin g.size()?
  • Why output contains 6,5,4, 1,8,9 numbers? What do they mean?
  • 什么是传递给lambdax这种情况下?
  • 什么gg.size()
  • 为什么输出包含 6,5,4, 1,8,9 数字?他们的意思是什么?

回答by piRSquared

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)

  class  val
0    c1    1
1    c1    1
2    c2    2
3    c2    2
4    c3    3
5    c3    3


Answers to your follow-up questions

回答您的后续问题

  1. The xin the lambdaends up being a dataframe that is the subset of dfrepresented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
  2. gis the groupbyobject. I placed it in a named variable because I planned on using it twice. df.groupby('class').size()is an alternative way to do df['class'].value_counts()but since I was going to groupbyanyway, I might as well reuse the same groupby, use a sizeto get the value counts... saves time.
  3. Those numbers are the the index values from dfthat go with the sampling. I added reset_index(drop=True)to get rid of it.
  1. xlambda向上是一个数据帧是子集的端部df通过基团表示。这些数据帧中的每一个,每个组一个,通过 this 传递lambda
  2. ggroupby对象。我将它放在一个命名变量中,因为我计划使用它两次。 df.groupby('class').size()是一种替代方法,df['class'].value_counts()但由于我groupby无论如何都要这样做,我不妨重用相同的groupby, 使用 asize来获取值计数...节省时间。
  3. 这些数字是来自df采样的索引值。我添加reset_index(drop=True)了摆脱它。

回答by Samuel Nde

The above answer is correct but I would love to specify that the gabove is not a Pandas DataFrameobject which the user most likely wants. It is a pandas.core.groupby.groupby.DataFrameGroupByobject. To see this, try calling headon gand the result will be as shown below.

上面的答案是正确的,但我想说明上面的g不是Pandas DataFrame用户最可能想要的对象。它是一个pandas.core.groupby.groupby.DataFrameGroupBy对象。看到这一点,尝试调用head,结果将是如下图所示。

import pandas as pd
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
    }

d = pd.DataFrame(d)
g = d.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
g.head()
>>> class val
0    c1    1
1    c2    2
2    c1    1
3    c1    1
4    c2    2
5    c1    1
6    c1    1
7    c2    2
8    c3    3
9    c3    3

To fix this, we need to convert ginto a Pandas DataFrameafter grouping the data as follows:

为了解决这个问题,我们需要在对数据进行分组后将g转换为 a Pandas DataFrame,如下所示:

g = d.groupby('class')
g = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))

Calling the head now yields:

现在调用 head 会产生:

g.head()

>>>class val
0   c1   1
1   c2   2
2   c1   1
3   c1   1
4   c2   2

Which is most likely what the user wants.

这很可能是用户想要的。

回答by Jhon Intriago Thoth

This method get randomly k elements of each class.

该方法随机获取每个类的 k 个元素。

def sampling_k_elements(group, k=3):
    if len(group) < k:
        return group
    return group.sample(k)

balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)

回答by Black Panter

"The following code works for undersampling of unbalanced classes but it's too much sorry for that.Try it! And also it works the same for upsampling problems! Good Luck!"

“以下代码适用于不平衡类的欠采样,但对此感到非常抱歉。试试吧!它也适用于上采样问题!祝你好运!”

Import required sampling libraries

导入所需的采样库

from sklearn.utils import resample

Define the majority and minority class

定义多数类和少数类

 df_minority9 = df[df['class']=='c9']
    df_majority1 = df[df['class']=='c1']
    df_majority2 = df[df['class']=='c2']
    df_majority3 = df[df['class']=='c3']
    df_majority4 = df[df['class']=='c4']
    df_majority5 = df[df['class']=='c5']
    df_majority6 = df[df['class']=='c6']
    df_majority7 = df[df['class']=='c7']
    df_majority8 = df[df['class']=='c8']

Unndersample majority class

欠采样多数类

 maj_class1 = resample(df_majority1, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class2 = resample(df_majority2, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class3 = resample(df_majority3, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class4 = resample(df_majority4, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class5 = resample(df_majority5, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class6 = resample(df_majority6, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class7 = resample(df_majority7, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class8 = resample(df_majority8, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 

Combine minority class with undersampled majority class

将少数类与欠采样的多数类相结合

df=pd.concat([df_minority9,maj_class1,maj_class2,maj_class3,maj_class4, maj_class5,dmaj_class6,maj_class7,maj_class8])

Display new balanced class counts

显示新的平衡类计数

 df['class'].value_counts()