如何以预定义的百分比拆分 Pandas 中的 DataFrame?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/43777243/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a DataFrame in pandas in predefined percentages?
提问by Dimitris P.
I have a pandas dataframe sorted by a number of columns. Now I'd like to split the dataframe in predefined percentages, so as to extract and name a few segments.
我有一个按多列排序的Pandas数据框。现在我想以预定义的百分比分割数据帧,以便提取和命名一些段。
For example, I want to take the first 20% of rows to create the first segment, then the next 30% for the second segment and leave the remaining 50% to the third segment.
例如,我想用前 20% 的行来创建第一个段,然后用接下来的 30% 来创建第二个段,剩下的 50% 留给第三个段。
How would I achieve that?
我将如何实现这一目标?
回答by jezrael
Use numpy.split
:
使用numpy.split
:
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
Sample:
样本:
np.random.seed(100)
df = pd.DataFrame(np.random.random((20,5)), columns=list('ABCDE'))
#print (df)
a, b, c = np.split(df, [int(.2*len(df)), int(.5*len(df))])
print (a)
A B C D E
0 0.543405 0.278369 0.424518 0.844776 0.004719
1 0.121569 0.670749 0.825853 0.136707 0.575093
2 0.891322 0.209202 0.185328 0.108377 0.219697
3 0.978624 0.811683 0.171941 0.816225 0.274074
print (b)
A B C D E
4 0.431704 0.940030 0.817649 0.336112 0.175410
5 0.372832 0.005689 0.252426 0.795663 0.015255
6 0.598843 0.603805 0.105148 0.381943 0.036476
7 0.890412 0.980921 0.059942 0.890546 0.576901
8 0.742480 0.630184 0.581842 0.020439 0.210027
9 0.544685 0.769115 0.250695 0.285896 0.852395
print (c)
A B C D E
10 0.975006 0.884853 0.359508 0.598859 0.354796
11 0.340190 0.178081 0.237694 0.044862 0.505431
12 0.376252 0.592805 0.629942 0.142600 0.933841
13 0.946380 0.602297 0.387766 0.363188 0.204345
14 0.276765 0.246536 0.173608 0.966610 0.957013
15 0.597974 0.731301 0.340385 0.092056 0.463498
16 0.508699 0.088460 0.528035 0.992158 0.395036
17 0.335596 0.805451 0.754349 0.313066 0.634037
18 0.540405 0.296794 0.110788 0.312640 0.456979
19 0.658940 0.254258 0.641101 0.200124 0.657625
回答by Gal Fridman
I've written a simple function that does the job.
我写了一个简单的函数来完成这项工作。
Maybe that might help you.
也许那可能对你有帮助。
P.S:
PS:
- Sum of fractions must be 1.
It will return len(fracs) new dfs. so you can insert fractions list at long as you want (e.g: fracs=[0.1, 0.1, 0.3, 0.2, 0.2])
np.random.seed(100) df = pd.DataFrame(np.random.random((99,4))) def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42): assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs)) remain = df.index.copy().to_frame() res = [] for i in range(len(fracs)): fractions_sum=sum(fracs[i:]) frac = fracs[i]/fractions_sum idxs = remain.sample(frac=frac, random_state=random_state).index remain=remain.drop(idxs) res.append(idxs) return [df.loc[idxs] for idxs in res] train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation] print(train.shape, test.shape, val.shape)
outputs:
(79, 4) (10, 4) (10, 4)
- 分数之和必须为 1。
它将返回 len(fracs) 新的 dfs。所以你可以根据需要插入分数列表(例如:fracs=[0.1, 0.1, 0.3, 0.2, 0.2])
np.random.seed(100) df = pd.DataFrame(np.random.random((99,4))) def split_by_fractions(df:pd.DataFrame, fracs:list, random_state:int=42): assert sum(fracs)==1.0, 'fractions sum is not 1.0 (fractions_sum={})'.format(sum(fracs)) remain = df.index.copy().to_frame() res = [] for i in range(len(fracs)): fractions_sum=sum(fracs[i:]) frac = fracs[i]/fractions_sum idxs = remain.sample(frac=frac, random_state=random_state).index remain=remain.drop(idxs) res.append(idxs) return [df.loc[idxs] for idxs in res] train,test,val = split_by_fractions(df, [0.8,0.1,0.1]) # e.g: [test, train, validation] print(train.shape, test.shape, val.shape)
输出:
(79, 4) (10, 4) (10, 4)