如何以预定义的百分比拆分 Pandas 中的 DataFrame？

Question

提问by Dimitris P.

I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments.

我有一个按多列排序的Pandas数据框。现在我想以预定义的百分比分割数据帧，以便提取和命名一些段。

For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment.

例如，我想用前 20% 的行来创建第一个段，然后用接下来的 30% 来创建第二个段，剩下的 50% 留给第三个段。

How would I achieve that?

我将如何实现这一目标？

Answer 1

回答by jezrael

Use numpy.split:

使用numpy.split：

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])

Sample:

样本：

np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)

a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
          A         B         C         D         E
0  0.543405  0.278369  0.424518  0.844776  0.004719
1  0.121569  0.670749  0.825853  0.136707  0.575093
2  0.891322  0.209202  0.185328  0.108377  0.219697
3  0.978624  0.811683  0.171941  0.816225  0.274074

print (b)
          A         B         C         D         E
4  0.431704  0.940030  0.817649  0.336112  0.175410
5  0.372832  0.005689  0.252426  0.795663  0.015255
6  0.598843  0.603805  0.105148  0.381943  0.036476
7  0.890412  0.980921  0.059942  0.890546  0.576901
8  0.742480  0.630184  0.581842  0.020439  0.210027
9  0.544685  0.769115  0.250695  0.285896  0.852395

print (c)
           A         B         C         D         E
10  0.975006  0.884853  0.359508  0.598859  0.354796
11  0.340190  0.178081  0.237694  0.044862  0.505431
12  0.376252  0.592805  0.629942  0.142600  0.933841
13  0.946380  0.602297  0.387766  0.363188  0.204345
14  0.276765  0.246536  0.173608  0.966610  0.957013
15  0.597974  0.731301  0.340385  0.092056  0.463498
16  0.508699  0.088460  0.528035  0.992158  0.395036
17  0.335596  0.805451  0.754349  0.313066  0.634037
18  0.540405  0.296794  0.110788  0.312640  0.456979
19  0.658940  0.254258  0.641101  0.200124  0.657625

Answer 2

回答by Gal Fridman

I've written a simple function that does the job.

我写了一个简单的函数来完成这项工作。

Maybe that might help you.

也许那可能对你有帮助。

P.S:

PS：

Sum of fractions must be 1.

It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])

np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))

def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
    assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
    remain = df.index.copy().to_frame()
    res = []
    for i in range(len(fracs)):
        fractions_sum=sum(fracs[i:])
        frac = fracs[i]/fractions_sum
        idxs = remain.sample(frac=frac, random_state=random_state).index
        remain=remain.drop(idxs)
        res.append(idxs)
    return [df.loc[idxs] for idxs in res]

train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]

print(train.shape, test.shape, val.shape)

outputs:

(79, 4) (10, 4) (10, 4)

分数之和必须为 1。

它将返回 len(fracs) 新的 dfs。所以你可以根据需要插入分数列表（例如：fracs=[0.1, 0.1, 0.3, 0.2, 0.2]）

np.random.seed(100)
df = pd.DataFrame(np.random.random((99,4)))

def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42):
    assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs))
    remain = df.index.copy().to_frame()
    res = []
    for i in range(len(fracs)):
        fractions_sum=sum(fracs[i:])
        frac = fracs[i]/fractions_sum
        idxs = remain.sample(frac=frac, random_state=random_state).index
        remain=remain.drop(idxs)
        res.append(idxs)
    return [df.loc[idxs] for idxs in res]

train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation]

print(train.shape, test.shape, val.shape)

输出：

(79, 4) (10, 4) (10, 4)

如何以预定义的百分比拆分 Pandas 中的 DataFrame？

提问by Dimitris P.

回答by jezrael

回答by Gal Fridman

相关推荐

最近更新

标签

如何以预定义的百分比拆分 Pandas 中的 DataFrame？

提问by Dimitris P.

回答by jezrael

回答by Gal Fridman

相关推荐

Pandas to_sql 将列类型从 varchar 更改为 text

Pandas 数据框列中值的第一个实例

pandas Python中字典和pandas系列的区别

pandas 熊猫：将月份中的日期转换为下个月的第一天

相关推荐

最近更新

标签