Python 将数据帧拆分为多个数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19790790/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:38:25  来源:igfitidea点击:

Splitting dataframe into multiple dataframes

pythonsplitpandasdataframe

提问by Martin Petri Bagger

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents). I would like to split the dataframe into 60 dataframes (a dataframe for each participant).

我有一个非常大的数据框(大约 100 万行),其中包含来自实验(60 名受访者)的数据。我想将数据帧拆分为 60 个数据帧(每个参与者一个数据帧)。

In the dataframe (called = data) there is a variable called 'name' which is the unique code for each participant.

在数据帧(称为 = 数据)中有一个名为“名称”的变量,它是每个参与者的唯一代码。

I have tried the following, but nothing happens (or the does not stop within an hour). What I intend to do is to split the dataframe (data) into smaller dataframes and append these to a list (datalist):

我尝试了以下操作,但没有任何反应(或者在一个小时内没有停止)。我打算做的是将数据帧(数据)拆分为较小的数据帧并将它们附加到列表(数据列表)中:

import pandas as pd

def splitframe(data, name='name'):

    n = data[name][0]

    df = pd.DataFrame(columns=data.columns)

    datalist = []

    for i in range(len(data)):
        if data[name][i] == n:
            df = df.append(data.iloc[i])
        else:
            datalist.append(df)
            df = pd.DataFrame(columns=data.columns)
            n = data[name][i]
            df = df.append(data.iloc[i])

    return datalist

I do not get an error message, the script just seems to run forever!

我没有收到错误消息,脚本似乎永远运行!

Is there a smart way to do it?

有没有聪明的方法来做到这一点?

采纳答案by EdChum

Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.

首先,您的方法效率低下,因为逐行附加到列表会很慢,因为当新条目的空间不足时,它必须定期增加列表,列表理解在这方面更好,因为大小已确定前面并分配一次。

However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?

但是,我认为从根本上讲,您的方法有点浪费,因为您已经有了一个数据框,那么为什么要为这些用户中的每一个都创建一个新的呢?

I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.

我会按列对数据框进行排序'name',将索引设置为这个,如果需要,不要删除该列。

Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection critieria to return a view on the dataframe without incurring a costly data copy.

然后生成所有唯一条目的列表,然后您可以使用这些条目执行查找,关键是如果您只查询数据,请使用选择标准返回数据帧上的视图,而不会产生昂贵的数据副本。

So:

所以:

# sort the dataframe
df.sort(columns=['name'], inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'

EDIT

编辑

sortis now deprecated, you need to use sort_valuesnow:

sort现在已弃用,您sort_values现在需要使用:

# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'

回答by Jeff

In [28]: df = DataFrame(np.random.randn(1000000,10))

In [29]: df
Out[29]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0    1000000  non-null values
1    1000000  non-null values
2    1000000  non-null values
3    1000000  non-null values
4    1000000  non-null values
5    1000000  non-null values
6    1000000  non-null values
7    1000000  non-null values
8    1000000  non-null values
9    1000000  non-null values
dtypes: float64(10)

In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]

In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop

In [32]: len(frames)
Out[32]: 16667

Here's a groupby way (and you could do an arbitrary apply rather than sum)

这是一种分组方式(您可以进行任意应用而不是求和)

In [9]: g = df.groupby(lambda x: x/60)

In [8]: g.sum()    

Out[8]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0    16667  non-null values
1    16667  non-null values
2    16667  non-null values
3    16667  non-null values
4    16667  non-null values
5    16667  non-null values
6    16667  non-null values
7    16667  non-null values
8    16667  non-null values
9    16667  non-null values
dtypes: float64(10)

Sum is cythonized that's why this is so fast

Sum 被 cythonized 这就是为什么它如此之快

In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop

In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop

回答by Woody Pride

Can I ask why not just do it by slicing the data frame. Something like

我可以问为什么不通过切片数据框来做到这一点。就像是

#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})

#create unique list of names
UniqueNames = data.Names.unique()

#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in UniqueNames}

for key in DataFrameDict.keys():
    DataFrameDict[key] = data[:][data.Names == key]

Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter

嘿,你有一个数据框字典,就像(我认为)你想要的那样。需要访问一个?只需输入

DataFrameDict['Joe']

Hope that helps

希望有帮助

回答by Gusev Slava

Groupby can helps you:

Groupby 可以帮助您:

grouped = data.groupby(['name'])

Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.

然后,您可以像处理每个参与者的数据框一样处理每个组。而 DataFrameGroupBy 对象方法如(apply、transform、aggregate、head、first、last)返回一个 DataFrame 对象。

Or you can make list from groupedand get all DataFrame's by index:

或者您可以grouped通过索引从列表中获取所有数据帧:

l_grouped = list(grouped)

l_grouped[0][1]- DataFrame for first group with first name.

l_grouped[0][1]- 带有名字的第一个组的数据帧。

回答by Quickbeam2k1

In addition to Gusev Slava's answer, you might want to use groupby's groups:

除了 Gusev Slava 的回答之外,您可能还想使用 groupby 的组:

{key: df.loc[value] for key, value in df.groupby("name").groups.items()}

This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.

这将生成一个字典,其中包含您分组的键,指向相应的分区。优点是键被维护并且不会在列表索引中消失。

回答by jezrael

You can convert groupbyobject to tuplesand then to dict:

您可以将groupbyobject转换为tuples,然后转换为dict

df = pd.DataFrame({'Name':list('aabbef'),
                   'A':[4,5,4,5,5,4],
                   'B':[7,8,9,4,2,3],
                   'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])

print (df)
  Name  A  B  C
0    a  4  7  1
1    a  5  8  3
2    b  4  9  5
3    b  5  4  7
4    e  5  2  1
5    f  4  3  0

d = dict(tuple(df.groupby('Name')))
print (d)
{'b':   Name  A  B  C
2    b  4  9  5
3    b  5  4  7, 'e':   Name  A  B  C
4    e  5  2  1, 'a':   Name  A  B  C
0    a  4  7  1
1    a  5  8  3, 'f':   Name  A  B  C
5    f  4  3  0}

print (d['a'])
  Name  A  B  C
0    a  4  7  1
1    a  5  8  3

It is not recommended, but possible create DataFrames by groups:

推荐,但可以按组创建数据帧:

for i, g in df.groupby('Name'):
    globals()['df_' + str(i)] =  g

print (df_a)
  Name  A  B  C
0    a  4  7  1
1    a  5  8  3

回答by Daniel Braun

Easy:

简单:

[v for k, v in df.groupby('name')]

回答by Luigi Bungaro

I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.

我有类似的问题。我有 10 家不同商店和 50 种不同商品的每日销售额的时间序列。我需要将原始数据帧拆分为 500 个数据帧(10 个商店 * 50 个商店)以将机器学习模型应用于每个数据帧,而我无法手动完成。

This is the head of the dataframe:

这是数据帧的头部:

head of the dataframe: df

数据帧的头部:df

I have created two lists; one for the names of dataframes and one for the couple of array [item_number, store_number].

我创建了两个列表;一个用于数据帧的名称,另一个用于数组 [item_number, store_number]。

    list=[]
    for i in range(1,len(items)*len(stores)+1):
    global list
    list.append('df'+str(i))

    list_couple_s_i =[]
    for item in items:
          for store in stores:
                  global list_couple_s_i
                  list_couple_s_i.append([item,store])

And once the two lists are ready you can loop on them to create the dataframes you want:

一旦这两个列表准备就绪,您就可以循环它们以创建您想要的数据框:

         for name, it_st in zip(list,list_couple_s_i):
                   globals()[name] = df.where((df['item']==it_st[0]) & 
                                                (df['store']==(it_st[1])))
                   globals()[name].dropna(inplace=True)

In this way I have created 500 dataframes.

通过这种方式,我创建了 500 个数据帧。

Hope this will be helpful!

希望这会有所帮助!

回答by Ram Prajapati

The method based on list comprehension and groupby- Which stores all the split dataframe in list variable and can be accessed using the index.

基于列表理解的方法groupby- 将所有拆分数据帧存储在列表变量中,并且可以使用索引进行访问。

Example

例子

ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]

ans[0]
ans[0].column_name

回答by idnavid

You can use the groupby command, if you already have some labels for your data.

如果您的数据已经有一些标签,则可以使用 groupby 命令。

 out_list = [group[1] for group in in_series.groupby(label_series.values)]

Here's a detailed example:

这是一个详细的例子:

Let's say we want to partition a pd series using some labels into a list of chunks For example, in_seriesis:

假设我们想使用一些标签将 pd 系列划分为一个块列表,例如,in_series是:

2019-07-01 08:00:00   -0.10
2019-07-01 08:02:00    1.16
2019-07-01 08:04:00    0.69
2019-07-01 08:06:00   -0.81
2019-07-01 08:08:00   -0.64
Length: 5, dtype: float64

And its corresponding label_seriesis:

其对应的label_series是:

2019-07-01 08:00:00   1
2019-07-01 08:02:00   1
2019-07-01 08:04:00   2
2019-07-01 08:06:00   2
2019-07-01 08:08:00   2
Length: 5, dtype: float64

Run

out_list = [group[1] for group in in_series.groupby(label_series.values)]

which returns out_lista listof two pd.Series:

它返回out_list一个list两个pd.Series

[2019-07-01 08:00:00   -0.10
2019-07-01 08:02:00   1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00    0.69
2019-07-01 08:06:00   -0.81
2019-07-01 08:08:00   -0.64
Length: 3, dtype: float64]

Note that you can use some parameters from in_seriesitself to group the series, e.g., in_series.index.day

请注意,您可以使用in_series自身的一些参数来对系列进行分组,例如,in_series.index.day