Python 将数据帧拆分为多个数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19790790/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Splitting dataframe into multiple dataframes
提问by Martin Petri Bagger
I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents). I would like to split the dataframe into 60 dataframes (a dataframe for each participant).
我有一个非常大的数据框(大约 100 万行),其中包含来自实验(60 名受访者)的数据。我想将数据帧拆分为 60 个数据帧(每个参与者一个数据帧)。
In the dataframe (called = data) there is a variable called 'name' which is the unique code for each participant.
在数据帧(称为 = 数据)中有一个名为“名称”的变量,它是每个参与者的唯一代码。
I have tried the following, but nothing happens (or the does not stop within an hour). What I intend to do is to split the dataframe (data) into smaller dataframes and append these to a list (datalist):
我尝试了以下操作,但没有任何反应(或者在一个小时内没有停止)。我打算做的是将数据帧(数据)拆分为较小的数据帧并将它们附加到列表(数据列表)中:
import pandas as pd
def splitframe(data, name='name'):
n = data[name][0]
df = pd.DataFrame(columns=data.columns)
datalist = []
for i in range(len(data)):
if data[name][i] == n:
df = df.append(data.iloc[i])
else:
datalist.append(df)
df = pd.DataFrame(columns=data.columns)
n = data[name][i]
df = df.append(data.iloc[i])
return datalist
I do not get an error message, the script just seems to run forever!
我没有收到错误消息,脚本似乎永远运行!
Is there a smart way to do it?
有没有聪明的方法来做到这一点?
采纳答案by EdChum
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
首先,您的方法效率低下,因为逐行附加到列表会很慢,因为当新条目的空间不足时,它必须定期增加列表,列表理解在这方面更好,因为大小已确定前面并分配一次。
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
但是,我认为从根本上讲,您的方法有点浪费,因为您已经有了一个数据框,那么为什么要为这些用户中的每一个都创建一个新的呢?
I would sort the dataframe by column 'name'
, set the index to be this and if required not drop the column.
我会按列对数据框进行排序'name'
,将索引设置为这个,如果需要,不要删除该列。
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection critieria to return a view on the dataframe without incurring a costly data copy.
然后生成所有唯一条目的列表,然后您可以使用这些条目执行查找,关键是如果您只查询数据,请使用选择标准返回数据帧上的视图,而不会产生昂贵的数据副本。
So:
所以:
# sort the dataframe
df.sort(columns=['name'], inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
EDIT
编辑
sort
is now deprecated, you need to use sort_values
now:
sort
现在已弃用,您sort_values
现在需要使用:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
回答by Jeff
In [28]: df = DataFrame(np.random.randn(1000000,10))
In [29]: df
Out[29]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0 1000000 non-null values
1 1000000 non-null values
2 1000000 non-null values
3 1000000 non-null values
4 1000000 non-null values
5 1000000 non-null values
6 1000000 non-null values
7 1000000 non-null values
8 1000000 non-null values
9 1000000 non-null values
dtypes: float64(10)
In [30]: frames = [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
In [31]: %timeit [ df.iloc[i*60:min((i+1)*60,len(df))] for i in xrange(int(len(df)/60.) + 1) ]
1 loops, best of 3: 849 ms per loop
In [32]: len(frames)
Out[32]: 16667
Here's a groupby way (and you could do an arbitrary apply rather than sum)
这是一种分组方式(您可以进行任意应用而不是求和)
In [9]: g = df.groupby(lambda x: x/60)
In [8]: g.sum()
Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16667 entries, 0 to 16666
Data columns (total 10 columns):
0 16667 non-null values
1 16667 non-null values
2 16667 non-null values
3 16667 non-null values
4 16667 non-null values
5 16667 non-null values
6 16667 non-null values
7 16667 non-null values
8 16667 non-null values
9 16667 non-null values
dtypes: float64(10)
Sum is cythonized that's why this is so fast
Sum 被 cythonized 这就是为什么它如此之快
In [10]: %timeit g.sum()
10 loops, best of 3: 27.5 ms per loop
In [11]: %timeit df.groupby(lambda x: x/60)
1 loops, best of 3: 231 ms per loop
回答by Woody Pride
Can I ask why not just do it by slicing the data frame. Something like
我可以问为什么不通过切片数据框来做到这一点。就像是
#create some data with Names column
data = pd.DataFrame({'Names': ['Joe', 'John', 'Jasper', 'Jez'] *4, 'Ob1' : np.random.rand(16), 'Ob2' : np.random.rand(16)})
#create unique list of names
UniqueNames = data.Names.unique()
#create a data frame dictionary to store your data frames
DataFrameDict = {elem : pd.DataFrame for elem in UniqueNames}
for key in DataFrameDict.keys():
DataFrameDict[key] = data[:][data.Names == key]
Hey presto you have a dictionary of data frames just as (I think) you want them. Need to access one? Just enter
嘿,你有一个数据框字典,就像(我认为)你想要的那样。需要访问一个?只需输入
DataFrameDict['Joe']
Hope that helps
希望有帮助
回答by Gusev Slava
Groupby can helps you:
Groupby 可以帮助您:
grouped = data.groupby(['name'])
Then you can work with each group like with a dataframe for each participant. And DataFrameGroupBy object methods such as (apply, transform, aggregate, head, first, last) return a DataFrame object.
然后,您可以像处理每个参与者的数据框一样处理每个组。而 DataFrameGroupBy 对象方法如(apply、transform、aggregate、head、first、last)返回一个 DataFrame 对象。
Or you can make list from grouped
and get all DataFrame's by index:
或者您可以grouped
通过索引从列表中获取所有数据帧:
l_grouped = list(grouped)
l_grouped[0][1]
- DataFrame for first group with first name.
l_grouped[0][1]
- 带有名字的第一个组的数据帧。
回答by Quickbeam2k1
In addition to Gusev Slava's answer, you might want to use groupby's groups:
除了 Gusev Slava 的回答之外,您可能还想使用 groupby 的组:
{key: df.loc[value] for key, value in df.groupby("name").groups.items()}
This will yield a dictionary with the keys you have grouped by, pointing to the corresponding partitions. The advantage is that the keys are maintained and don't vanish in the list index.
这将生成一个字典,其中包含您分组的键,指向相应的分区。优点是键被维护并且不会在列表索引中消失。
回答by jezrael
You can convert groupby
object to tuples
and then to dict
:
您可以将groupby
object转换为tuples
,然后转换为dict
:
df = pd.DataFrame({'Name':list('aabbef'),
'A':[4,5,4,5,5,4],
'B':[7,8,9,4,2,3],
'C':[1,3,5,7,1,0]}, columns = ['Name','A','B','C'])
print (df)
Name A B C
0 a 4 7 1
1 a 5 8 3
2 b 4 9 5
3 b 5 4 7
4 e 5 2 1
5 f 4 3 0
d = dict(tuple(df.groupby('Name')))
print (d)
{'b': Name A B C
2 b 4 9 5
3 b 5 4 7, 'e': Name A B C
4 e 5 2 1, 'a': Name A B C
0 a 4 7 1
1 a 5 8 3, 'f': Name A B C
5 f 4 3 0}
print (d['a'])
Name A B C
0 a 4 7 1
1 a 5 8 3
It is not recommended, but possible create DataFrames by groups:
不推荐,但可以按组创建数据帧:
for i, g in df.groupby('Name'):
globals()['df_' + str(i)] = g
print (df_a)
Name A B C
0 a 4 7 1
1 a 5 8 3
回答by Daniel Braun
Easy:
简单:
[v for k, v in df.groupby('name')]
回答by Luigi Bungaro
I had similar problem. I had a time series of daily sales for 10 different stores and 50 different items. I needed to split the original dataframe in 500 dataframes (10stores*50stores) to apply Machine Learning models to each of them and I couldn't do it manually.
我有类似的问题。我有 10 家不同商店和 50 种不同商品的每日销售额的时间序列。我需要将原始数据帧拆分为 500 个数据帧(10 个商店 * 50 个商店)以将机器学习模型应用于每个数据帧,而我无法手动完成。
This is the head of the dataframe:
这是数据帧的头部:
I have created two lists; one for the names of dataframes and one for the couple of array [item_number, store_number].
我创建了两个列表;一个用于数据帧的名称,另一个用于数组 [item_number, store_number]。
list=[]
for i in range(1,len(items)*len(stores)+1):
global list
list.append('df'+str(i))
list_couple_s_i =[]
for item in items:
for store in stores:
global list_couple_s_i
list_couple_s_i.append([item,store])
And once the two lists are ready you can loop on them to create the dataframes you want:
一旦这两个列表准备就绪,您就可以循环它们以创建您想要的数据框:
for name, it_st in zip(list,list_couple_s_i):
globals()[name] = df.where((df['item']==it_st[0]) &
(df['store']==(it_st[1])))
globals()[name].dropna(inplace=True)
In this way I have created 500 dataframes.
通过这种方式,我创建了 500 个数据帧。
Hope this will be helpful!
希望这会有所帮助!
回答by Ram Prajapati
The method based on list comprehension and groupby
- Which stores all the split dataframe in list variable and can be accessed using the index.
基于列表理解的方法groupby
- 将所有拆分数据帧存储在列表变量中,并且可以使用索引进行访问。
Example
例子
ans = [pd.DataFrame(y) for x, y in DF.groupby('column_name', as_index=False)]
ans[0]
ans[0].column_name
回答by idnavid
You can use the groupby command, if you already have some labels for your data.
如果您的数据已经有一些标签,则可以使用 groupby 命令。
out_list = [group[1] for group in in_series.groupby(label_series.values)]
Here's a detailed example:
这是一个详细的例子:
Let's say we want to partition a pd series using some labels into a list of chunks
For example, in_series
is:
假设我们想使用一些标签将 pd 系列划分为一个块列表,例如,in_series
是:
2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 5, dtype: float64
And its corresponding label_series
is:
其对应的label_series
是:
2019-07-01 08:00:00 1
2019-07-01 08:02:00 1
2019-07-01 08:04:00 2
2019-07-01 08:06:00 2
2019-07-01 08:08:00 2
Length: 5, dtype: float64
Run
跑
out_list = [group[1] for group in in_series.groupby(label_series.values)]
which returns out_list
a list
of two pd.Series
:
它返回out_list
一个list
两个pd.Series
:
[2019-07-01 08:00:00 -0.10
2019-07-01 08:02:00 1.16
Length: 2, dtype: float64,
2019-07-01 08:04:00 0.69
2019-07-01 08:06:00 -0.81
2019-07-01 08:08:00 -0.64
Length: 3, dtype: float64]
Note that you can use some parameters from in_series
itself to group the series, e.g., in_series.index.day
请注意,您可以使用in_series
自身的一些参数来对系列进行分组,例如,in_series.index.day