pandas 熊猫：平衡数据

Question

提问by dokondr

Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"

注意：此问题与此处的答案不同：“Pandas：在 groupby 之后对每个组进行采样”

Trying to figure out how to use pandas.DataFrame.sampleor any other function to balance this data:

试图弄清楚如何使用pandas.DataFrame.sample或任何其他功能来平衡这些数据：

df[class].value_counts()

c1    9170
c2    5266
c3    4523
c4    2193
c5    1956
c6    1896
c7    1580
c8    1407
c9    1324

I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.

我需要获取每个类（c1、c2、.. c9）的随机样本，其中样本大小等于具有最小实例数的类的大小。在这个例子中，样本大小应该是类 c9 = 1324 的大小。

Any simple way to do this with Pandas?

有什么简单的方法可以用 Pandas 做到这一点？

Update

更新

To clarify my question, in the table above :

为了澄清我的问题，在上表中：

Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:

数字是 c1,c2,c3,... 类实例的计数，因此实际数据如下所示：

c1 'foo'
c2 'bar'
c1 'foo-2'
c1 'foo-145'
c1 'xxx-07'
c2 'zzz'
...

etc.

等等。

Update 2

更新 2

To clarify more:

澄清更多：

d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
    }

df = pd.DataFrame(d)

    class   val
0   c1  1
1   c2  2
2   c1  1
3   c1  1
4   c2  2
5   c1  1
6   c1  1
7   c2  2
8   c3  3
9   c3  3

df['class'].value_counts()

c1    5
c2    3
c3    2
Name: class, dtype: int64

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()))

        class   val
class           
c1  6   c1  1
    5   c1  1
c2  4   c2  2  
    1   c2  2
c3  9   c3  3
    8   c3  3

Looks like this works. Main questions:

看起来这有效。主要问题：

How g.apply(lambda x: x.sample(g.size().min()))works? I know what 'lambda` is, but:

如何g.apply(lambda x: x.sample(g.size().min()))运作？我知道“lambda”是什么，但是：

What is passed to lambdain xin this case?
What is gin g.size()?
Why output contains 6,5,4, 1,8,9 numbers? What do they mean?

什么是传递给lambda在x这种情况下？
什么g在g.size()？
为什么输出包含 6,5,4, 1,8,9 数字？他们的意思是什么？

Answer 1

回答by piRSquared

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)

  class  val
0    c1    1
1    c1    1
2    c2    2
3    c2    2
4    c3    3
5    c3    3

Answers to your follow-up questions

回答您的后续问题

The xin the lambdaends up being a dataframe that is the subset of dfrepresented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
gis the groupbyobject. I placed it in a named variable because I planned on using it twice. df.groupby('class').size()is an alternative way to do df['class'].value_counts()but since I was going to groupbyanyway, I might as well reuse the same groupby, use a sizeto get the value counts... saves time.
Those numbers are the the index values from dfthat go with the sampling. I added reset_index(drop=True)to get rid of it.

的x在lambda向上是一个数据帧是子集的端部df通过基团表示。这些数据帧中的每一个，每个组一个，通过 this 传递lambda。
g是groupby对象。我将它放在一个命名变量中，因为我计划使用它两次。 df.groupby('class').size()是一种替代方法，df['class'].value_counts()但由于我groupby无论如何都要这样做，我不妨重用相同的groupby, 使用 asize来获取值计数...节省时间。
这些数字是来自df采样的索引值。我添加reset_index(drop=True)了摆脱它。

Answer 2

回答by Samuel Nde

The above answer is correct but I would love to specify that the gabove is not a Pandas DataFrameobject which the user most likely wants. It is a pandas.core.groupby.groupby.DataFrameGroupByobject. To see this, try calling headon gand the result will be as shown below.

上面的答案是正确的，但我想说明上面的g不是Pandas DataFrame用户最可能想要的对象。它是一个pandas.core.groupby.groupby.DataFrameGroupBy对象。看到这一点，尝试调用head上摹，结果将是如下图所示。

import pandas as pd
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
    }

d = pd.DataFrame(d)
g = d.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
g.head()
>>> class val
0    c1    1
1    c2    2
2    c1    1
3    c1    1
4    c2    2
5    c1    1
6    c1    1
7    c2    2
8    c3    3
9    c3    3

To fix this, we need to convert ginto a Pandas DataFrameafter grouping the data as follows:

为了解决这个问题，我们需要在对数据进行分组后将g转换为 a Pandas DataFrame，如下所示：

g = d.groupby('class')
g = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))

Calling the head now yields:

现在调用 head 会产生：

g.head()

>>>class val
0   c1   1
1   c2   2
2   c1   1
3   c1   1
4   c2   2

Which is most likely what the user wants.

这很可能是用户想要的。

Answer 3

回答by Jhon Intriago Thoth

This method get randomly k elements of each class.

该方法随机获取每个类的 k 个元素。

def sampling_k_elements(group, k=3):
    if len(group) < k:
        return group
    return group.sample(k)

balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)

Answer 4

回答by Black Panter

"The following code works for undersampling of unbalanced classes but it's too much sorry for that.Try it! And also it works the same for upsampling problems! Good Luck!"

“以下代码适用于不平衡类的欠采样，但对此感到非常抱歉。试试吧！它也适用于上采样问题！祝你好运！”

Import required sampling libraries

导入所需的采样库

from sklearn.utils import resample

Define the majority and minority class

定义多数类和少数类

 df_minority9 = df[df['class']=='c9']
    df_majority1 = df[df['class']=='c1']
    df_majority2 = df[df['class']=='c2']
    df_majority3 = df[df['class']=='c3']
    df_majority4 = df[df['class']=='c4']
    df_majority5 = df[df['class']=='c5']
    df_majority6 = df[df['class']=='c6']
    df_majority7 = df[df['class']=='c7']
    df_majority8 = df[df['class']=='c8']

Unndersample majority class

欠采样多数类

 maj_class1 = resample(df_majority1, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class2 = resample(df_majority2, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class3 = resample(df_majority3, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class4 = resample(df_majority4, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class5 = resample(df_majority5, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class6 = resample(df_majority6, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class7 = resample(df_majority7, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class8 = resample(df_majority8, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123)

Combine minority class with undersampled majority class

将少数类与欠采样的多数类相结合

df=pd.concat([df_minority9,maj_class1,maj_class2,maj_class3,maj_class4, maj_class5,dmaj_class6,maj_class7,maj_class8])

Display new balanced class counts

显示新的平衡类计数

 df['class'].value_counts()

pandas 熊猫：平衡数据

提问by dokondr

回答by piRSquared

回答by Samuel Nde

回答by Jhon Intriago Thoth

回答by Black Panter

Import required sampling libraries

导入所需的采样库

Define the majority and minority class

定义多数类和少数类

Unndersample majority class

欠采样多数类

Combine minority class with undersampled majority class

将少数类与欠采样的多数类相结合

Display new balanced class counts

显示新的平衡类计数

相关推荐

最近更新

标签

pandas 熊猫：平衡数据

提问by dokondr

回答by piRSquared

回答by Samuel Nde

回答by Jhon Intriago Thoth

回答by Black Panter

Import required sampling libraries

导入所需的采样库

Define the majority and minority class

定义多数类和少数类

Unndersample majority class

欠采样多数类

Combine minority class with undersampled majority class

将少数类与欠采样的多数类相结合

Display new balanced class counts

显示新的平衡类计数

相关推荐

pandas 如何从python中的数据帧绘制x轴和y轴的直方图

pandas sklearn：发现样本数量不一致的输入变量：[1, 99]

pandas 错误，“只允许将类似列表的对象传递给 isin()，您传递了一个 [int]”

Pandas.read_excel：不支持的格式，或损坏的文件：预期的 BOF 记录

相关推荐

最近更新

标签