Python 拆分大熊猫数据框

Question

提问by Nilani Algiriyage

I have a large dataframe with 423244 lines. I want to split this in to 4. I tried the following code which gave an error? ValueError: array split does not result in an equal division

我有一个包含 423244 行的大数据框。我想将其拆分为 4。我尝试了以下代码，但出现错误？ValueError: array split does not result in an equal division

for item in np.split(df, 4):
    print item

How to split this dataframe in to 4 groups?

如何将此数据帧分成 4 组？

Answer 1

采纳答案by root

Use np.array_split:

使用np.array_split：

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

Answer 2

回答by yemu

Caution:

警告：

np.array_splitdoesn't work with numpy-1.9.0. I checked out: It works with 1.8.1.

np.array_split不适用于 numpy-1.9.0。我签出：它适用于 1.8.1。

Error:

错误：

Dataframe has no 'size' attribute

数据框没有“大小”属性

Answer 3

回答by elixir

I wanted to do the same, and I had first problems with the split finction, then problems with installing pandas 0.15.2, so I went back to my old version, and wrote a little function that works very well. I hope this can help!

我想做同样的事情，我首先遇到了拆分功能的问题，然后是安装 pandas 0.15.2 的问题，所以我回到了我的旧版本，并编写了一个运行良好的小函数。我希望这能有所帮助！

# input - df: a Dataframe, chunkSize: the chunk size
# output - a list of DataFrame
# purpose - splits the DataFrame into smaller of max size chunkSize (last is smaller)
def splitDataFrameIntoSmaller(df, chunkSize = 10000): 
    listOfDf = list()
    numberChunks = len(df) // chunkSize + 1
    for i in range(numberChunks):
        listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
    return listOfDf

Answer 4

回答by Gilberto

Be aware that np.array_split(df, 3)splits the dataframe into 3 sub-dataframes, while splitDataFrameIntoSmaller(df, chunkSize = 3)splits the dataframe every chunkSizerows.

请注意，np.array_split(df, 3)将数据帧拆分为 3 个子数据帧，同时splitDataFrameIntoSmaller(df, chunkSize = 3)将数据帧每chunkSize行拆分。

Example:

例子：

df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11], columns=['TEST'])
df_split = np.array_split(df, 3)

You get 3 sub-dataframes:

你得到 3 个子数据帧：

df_split[0] # 1, 2, 3, 4
df_split[1] # 5, 6, 7, 8
df_split[2] # 9, 10, 11

With:

和：

df_split2 = splitDataFrameIntoSmaller(df, chunkSize = 3)

You get 4 sub-dataframes:

你得到 4 个子数据帧：

df_split2[0] # 1, 2, 3
df_split2[1] # 4, 5, 6
df_split2[2] # 7, 8, 9
df_split2[3] # 10, 11

Hope I'm right, hope this is usefull.

希望我是对的，希望这有用。

Answer 5

回答by rumpel

You can use groupby, assuming you have an integer enumerated index:

您可以使用groupby，假设您有一个整数枚举索引：

import math
df = pd.DataFrame(dict(sample=np.arange(99)))
rows_per_subframe = math.ceil(len(df) / 4.)

subframes = [i[1] for i in df.groupby(np.arange(len(df))//rows_per_subframe)]

Note: groupbyreturns a tuple in which the 2nd element is the dataframe, thus the slightly complicated extraction.

注意：groupby返回一个元组，其中第二个元素是数据帧，因此提取稍微复杂一些。

>>> len(subframes), [len(i) for i in subframes]
(4, [25, 25, 25, 24])

Answer 6

回答by pratpor

I guess now we can use plain ilocwith rangefor this.

我想现在我们可以使用plain ilocwithrange了。

chunk_size = int(df.shape[0] / 4)
for start in range(0, df.shape[0], chunk_size):
    df_subset = df.iloc[start:start + chunk_size]
    process_data(df_subset)
    ....

Answer 7

回答by Martin Alexandersson

I also experienced np.array_split not working with Pandas DataFrame my solution was to only split the index of the DataFrame and then introduce a new column with the "group" label:

我也遇到过 np.array_split 不能使用 Pandas DataFrame 我的解决方案是只拆分 DataFrame 的索引，然后引入一个带有“group”标签的新列：

indexes = np.array_split(df.index,N, axis=0)
for i,index in enumerate(indexes):
   df.loc[index,'group'] = i

This makes grouby operations very convenient for instance calculation of mean value of each group:

这使得 grouby 操作非常方便，例如计算每个组的平均值：

df.groupby(by='group').mean()

Python 拆分大熊猫数据框

提问by Nilani Algiriyage

采纳答案by root

回答by yemu

回答by elixir

回答by Gilberto

回答by rumpel

回答by pratpor

回答by Martin Alexandersson

相关推荐

最近更新

标签

Python 拆分大熊猫数据框

提问by Nilani Algiriyage

采纳答案by root

回答by yemu

回答by elixir

回答by Gilberto

回答by rumpel

回答by pratpor

回答by Martin Alexandersson

相关推荐

为 Python 安装 OpenCV（多个 Python 版本）

Python AttributeError: 'tuple' 对象没有属性

Python 如何使用PIL从100张图片中获取平均图片？

Python Django 导入错误：没有名为应用程序的模块

相关推荐

最近更新

标签