如何以预定义的百分比拆分 Pandas 中的 DataFrame?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/43777243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:31:59  来源:igfitidea点击:

How to split a DataFrame in pandas in predefined percentages?

python-3.xpandas

提问by Dimitris P.

I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments.

我有一个按多列排序的Pandas数据框。现在我想以预定义的百分比分割数据帧,以便提取和命名一些段。

For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment.

例如,我想用前 20% 的行来创建第一个段,然后用接下来的 30% 来创建第二个段,剩下的 50% 留给第三个段。

How would I achieve that?

我将如何实现这一目标?

回答by jezrael

Use numpy.split:

使用numpy.split

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

样本:

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
0  0.543405  0.278369  0.424518  0.844776  0.004719
1  0.121569  0.670749  0.825853  0.136707  0.575093
2  0.891322  0.209202  0.185328  0.108377  0.219697
3  0.978624  0.811683  0.171941  0.816225  0.274074

print (b)
          A         B         C         D         E
4  0.431704  0.940030  0.817649  0.336112  0.175410
5  0.372832  0.005689  0.252426  0.795663  0.015255
6  0.598843  0.603805  0.105148  0.381943  0.036476
7  0.890412  0.980921  0.059942  0.890546  0.576901
8  0.742480  0.630184  0.581842  0.020439  0.210027
9  0.544685  0.769115  0.250695  0.285896  0.852395

print (c)
           A         B         C         D         E
10  0.975006  0.884853  0.359508  0.598859  0.354796
11  0.340190  0.178081  0.237694  0.044862  0.505431
12  0.376252  0.592805  0.629942  0.142600  0.933841
13  0.946380  0.602297  0.387766  0.363188  0.204345
14  0.276765  0.246536  0.173608  0.966610  0.957013
15  0.597974  0.731301  0.340385  0.092056  0.463498
16  0.508699  0.088460  0.528035  0.992158  0.395036
17  0.335596  0.805451  0.754349  0.313066  0.634037
18  0.540405  0.296794  0.110788  0.312640  0.456979
19  0.658940  0.254258  0.641101  0.200124  0.657625

回答by Gal Fridman

I've written a simple function that does the job.

我写了一个简单的函数来完成这项工作。

Maybe that might help you.

也许那可能对你有帮助。

P.S:

PS:

  • Sum of fractions must be 1.
  • It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])

    np.random.seed(100)
    df = pd.DataFrame(np.random.random((99,4)))
    
    def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
        assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
        remain = df.index.copy().to_frame()
        res = []
        for i in range(len(fracs)):
            fractions_sum=sum(fracs[i:])
            frac = fracs[i]/fractions_sum
            idxs = remain.sample(frac=frac, random_state=random_state).index
            remain=remain.drop(idxs)
            res.append(idxs)
        return [df.loc[idxs] for idxs in res]
    
    train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]
    
    print(train.shape, test.shape, val.shape)
    

    outputs:

    (79, 4) (10, 4) (10, 4)
    
  • 分数之和必须为 1。
  • 它将返回 len(fracs) 新的 dfs。所以你可以根据需要插入分数列表(例如:fracs=[0.1, 0.1, 0.3, 0.2, 0.2])

    np.random.seed(100)
    df = pd.DataFrame(np.random.random((99,4)))
    
    def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
        assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
        remain = df.index.copy().to_frame()
        res = []
        for i in range(len(fracs)):
            fractions_sum=sum(fracs[i:])
            frac = fracs[i]/fractions_sum
            idxs = remain.sample(frac=frac, random_state=random_state).index
            remain=remain.drop(idxs)
            res.append(idxs)
        return [df.loc[idxs] for idxs in res]
    
    train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]
    
    print(train.shape, test.shape, val.shape)
    

    输出:

    (79, 4) (10, 4) (10, 4)