Pandas 使用行索引拆分数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53391378/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 06:09:55  来源:igfitidea点击:

Pandas Split DataFrame using row index

pythonpandasdataframepandas-groupby

提问by Pradeep Tummala

I want to split dataframe by uneven number of rows using row index.

我想使用行索引按奇数行拆分数据帧。

The below code:

下面的代码:

groups = df.groupby((np.arange(len(df.index))/l[1]).astype(int))

works only for uniform number of rows.

仅适用于统一数量的行。

df

a b c  
1 1 1  
2 2 2  
3 3 3  
4 4 4  
5 5 5  
6 6 6  
7 7 7  

l = [2, 5, 7]

df1  
1 1 1  
2 2 2  

df2  
3,3,3  
4,4,4  
5,5,5  

df3  
6,6,6  
7,7,7  

df4  
8,8,8

回答by Scott Boston

You could use list comprehension with a little modications your list, l, first.

您可以先使用列表理解并稍加修改您的列表 l。

print(df)

   a  b  c
0  1  1  1
1  2  2  2
2  3  3  3
3  4  4  4
4  5  5  5
5  6  6  6
6  7  7  7
7  8  8  8


l = [2,5,7]
l_mod = [0] + l + [max(l)+1]

list_of_dfs = [df.iloc[l_mod[n]:l_mod[n+1]] for n in range(len(l_mod)-1)]

Output:

输出:

list_of_dfs[0]

   a  b  c
0  1  1  1
1  2  2  2

list_of_dfs[1]

   a  b  c
2  3  3  3
3  4  4  4
4  5  5  5

list_of_dfs[2]

   a  b  c
5  6  6  6
6  7  7  7

list_of_dfs[3]

   a  b  c
7  8  8  8

回答by Mohit Motwani

I think this is what you need:

我认为这就是你需要的:

df = pd.DataFrame({'a': np.arange(1, 8),
                  'b': np.arange(1, 8),
                  'c': np.arange(1, 8)})
df.head()
    a   b   c
0   1   1   1
1   2   2   2
2   3   3   3
3   4   4   4
4   5   5   5
5   6   6   6
6   7   7   7

last_check = 0
dfs = []
for ind in [2, 5, 7]:
    dfs.append(df.loc[last_check:ind-1])
    last_check = ind

Although list comprehension are much more efficient than a for loop, the last_check is necessary if you don't have a pattern in your list of indices.

尽管列表理解比 for 循环高效得多,但如果索引列表中没有模式,则必须使用 last_check。

dfs[0]

    a   b   c
0   1   1   1
1   2   2   2

dfs[2]

    a   b   c
5   6   6   6
6   7   7   7

回答by Mohamed Thasin ah

I think this is you are looking for.,

我想这就是你要找的。,

l = [2, 5, 7]
dfs=[]
i=0
for val in l:
    if i==0:
        temp=df.iloc[:val]
        dfs.append(temp)
    elif i==len(l):
        temp=df.iloc[val]
        dfs.append(temp)        
    else:
        temp=df.iloc[l[i-1]:val]
        dfs.append(temp)
    i+=1

Output:

输出:

   a  b  c
0  1  1  1
1  2  2  2
   a  b  c
2  3  3  3
3  4  4  4
4  5  5  5
   a  b  c
5  6  6  6
6  7  7  7

Another Solution:

另一个解决方案:

l = [2, 5, 7]
t= np.arange(l[-1])
l.reverse()
for val in l:
    t[:val]=val
temp=pd.DataFrame(t)
temp=pd.concat([df,temp],axis=1)
for u,v in temp.groupby(0):
    print v

Output:

输出:

   a  b  c  0
0  1  1  1  2
1  2  2  2  2
   a  b  c  0
2  3  3  3  5
3  4  4  4  5
4  5  5  5  5
   a  b  c  0
5  6  6  6  7
6  7  7  7  7

回答by jpp

You can create an array to use for indexing via NumPy:

您可以通过 NumPy 创建一个用于索引的数组:

import pandas as pd, numpy as np

df = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list('abc'))

L = [2, 5, 7]
idx = np.cumsum(np.in1d(np.arange(len(df.index)), L))

for _, chunk in df.groupby(idx):
    print(chunk, '\n')

   a  b  c
0  0  1  2
1  3  4  5 

    a   b   c
2   6   7   8
3   9  10  11
4  12  13  14 

    a   b   c
5  15  16  17
6  18  19  20 

    a   b   c
7  21  22  23 

Instead of defining a new variable for each dataframe, you can use a dictionary:

您可以使用字典,而不是为每个数据框定义一个新变量:

d = dict(tuple(df.groupby(idx)))

print(d[1])  # print second groupby value

    a   b   c
2   6   7   8
3   9  10  11
4  12  13  14