Python 使用Pandas创建DataFrame with Series，导致内存错误

Question

提问by Mattijn

I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items

我正在使用 Pandas 库进行遥感时间序列分析。最终我想通过使用块大小将我的 DataFrame 保存到 csv，但我遇到了一个小问题。我的代码生成 6 个 NumPy 数组，然后将其转换为 Pandas 系列。这些系列中的每一个都包含很多项目

>>> prcpSeries.shape
(12626172,)

I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.

我想将系列添加到 Pandas DataFrame (df) 中，以便我可以将它们逐块保存到 csv 文件中。

d = {'prcp': pd.Series(prcpSeries),
     'tmax': pd.Series(tmaxSeries),
     'tmin': pd.Series(tminSeries),
     'ndvi': pd.Series(ndviSeries),
     'lstm': pd.Series(lstmSeries),
     'evtm': pd.Series(evtmSeries)}

df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None

But my code get stuck at following line giving a Memory Error

但是我的代码卡在以下行中，出现内存错误

df = pd.DataFrame(d)

Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?

有什么建议？是否可以逐块填充 Pandas DataFrame 块？

Answer 1

采纳答案by Andy Hayden

If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:

如果您知道这些长度相同，那么您可以直接从数组创建 DataFrame，然后附加每一列：

df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...

Note: you can also use the to_framemethod (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):

注意：您还可以使用该to_frame方法（它允许您（可选）传递一个名称 - 如果系列没有名称，这将很有用）：

df = prcpSeries.to_frame(name='prcp')

However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):

但是，如果它们是可变长度的，那么这将丢失一些数据（任何长度超过的数组prcpSeries）。这里的另一种方法是将每个创建为 DataFrame，然后执行外连接（使用concat）：

df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...

df = pd.concat([df1, df2, ...], join='outer', axis=1)

For example:

例如：

In [21]: dfA = pd.DataFrame([1,2], columns=['A'])

In [22]: dfB = pd.DataFrame([1], columns=['B'])

In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
   A   B
0  1   1
1  2 NaN

Python 使用Pandas创建DataFrame with Series，导致内存错误

提问by Mattijn

采纳答案by Andy Hayden

相关推荐

最近更新

标签

Python 使用Pandas创建DataFrame with Series，导致内存错误

提问by Mattijn

采纳答案by Andy Hayden

相关推荐

如何在python字典列表中找到一个值？

Python 熊猫跨列求和并将每个单元格从该值中除以

Python matplotlib图例中的项目顺序是如何确定的？

Python Matplotlib - 强制绘图显示然后返回主代码

相关推荐

最近更新

标签