Python 拆分大熊猫数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17315737/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 07:51:51  来源:igfitidea点击:

Split a large pandas dataframe

pythonpandas

提问by Nilani Algiriyage

I have a large dataframe with 423244 lines. I want to split this in to 4. I tried the following code which gave an error? ValueError: array split does not result in an equal division

我有一个包含 423244 行的大数据框。我想将其拆分为 4。我尝试了以下代码,但出现错误?ValueError: array split does not result in an equal division

for item in np.split(df, 4):
    print item

How to split this dataframe in to 4 groups?

如何将此数据帧分成 4 组?

采纳答案by root

Use np.array_split:

使用np.array_split

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.


In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

回答by yemu

Caution:

警告:

np.array_splitdoesn't work with numpy-1.9.0. I checked out: It works with 1.8.1.

np.array_split不适用于 numpy-1.9.0。我签出:它适用于 1.8.1。

Error:

错误:

Dataframe has no 'size' attribute

数据框没有“大小”属性

回答by elixir

I wanted to do the same, and I had first problems with the split finction, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!

我想做同样的事情,我首先遇到了拆分功能的问题,然后是安装 pandas 0.15.2 的问题,所以我回到了我的旧版本,并编写了一个运行良好的小函数。我希望这能有所帮助!

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller of max size chunkSize (last is smaller)
def splitDataFrameIntoSmaller(df, chunkSize = 10000): 
    listOfDf = list()
    numberChunks = len(df) // chunkSize + 1
    for i in range(numberChunks):
        listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
    return listOfDf

回答by Gilberto

Be aware that np.array_split(df, 3)splits the dataframe into 3 sub-dataframes, while splitDataFrameIntoSmaller(df, chunkSize = 3)splits the dataframe every chunkSizerows.

请注意,np.array_split(df, 3)将数据帧拆分为 3 个子数据帧,同时splitDataFrameIntoSmaller(df, chunkSize = 3)将数据帧每chunkSize行拆分。

Example:

例子:

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)

You get 3 sub-dataframes:

你得到 3 个子数据帧:

df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11

With:

和:

df_split2 = splitDataFrameIntoSmaller(df, chunkSize = 3)

You get 4 sub-dataframes:

你得到 4 个子数据帧:

df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11

Hope I'm right, hope this is usefull.

希望我是对的,希望这有用。

回答by rumpel

You can use groupby, assuming you have an integer enumerated index:

您可以使用groupby,假设您有一个整数枚举索引:

import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)

subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

Note: groupbyreturns a tuple in which the 2nd element is the dataframe, thus the slightly complicated extraction.

注意:groupby返回一个元组,其中第二个元素是数据帧,因此提取稍微复杂一些。

>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])

回答by pratpor

I guess now we can use plain ilocwith rangefor this.

我想现在我们可以使用plain ilocwithrange了。

chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
    df_subset = df.iloc[start:start + chunk_size]
    process_data(df_subset)
    ....

回答by Martin Alexandersson

I also experienced np.array_split not working with Pandas DataFrame my solution was to only split the index of the DataFrame and then introduce a new column with the "group" label:

我也遇到过 np.array_split 不能使用 Pandas DataFrame 我的解决方案是只拆分 DataFrame 的索引,然后引入一个带有“group”标签的新列:

indexes = np.array_split(df.index,N, axis=0)
for i,index in enumerate(indexes):
   df.loc[index,'group'] = i

This makes grouby operations very convenient for instance calculation of mean value of each group:

这使得 grouby 操作非常方便,例如计算每个组的平均值:

df.groupby(by='group').mean()