Python 使用Pandas创建DataFrame with Series,导致内存错误

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17165340/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:39:08  来源:igfitidea点击:

Using Pandas to create DataFrame with Series, resulting in memory error

pythonnumpypandas

提问by Mattijn

I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items

我正在使用 Pandas 库进行遥感时间序列分析。最终我想通过使用块大小将我的 DataFrame 保存到 csv,但我遇到了一个小问题。我的代码生成 6 个 NumPy 数组,然后将其转换为 Pandas 系列。这些系列中的每一个都包含很多项目

>>> prcpSeries.shape
(12626172,)

I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.

我想将系列添加到 Pandas DataFrame (df) 中,以便我可以将它们逐块保存到 csv 文件中。

d = {'prcp': pd.Series(prcpSeries),
     'tmax': pd.Series(tmaxSeries),
     'tmin': pd.Series(tminSeries),
     'ndvi': pd.Series(ndviSeries),
     'lstm': pd.Series(lstmSeries),
     'evtm': pd.Series(evtmSeries)}

df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None

But my code get stuck at following line giving a Memory Error

但是我的代码卡在以下行中,出现内存错误

df = pd.DataFrame(d)

Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?

有什么建议?是否可以逐块填充 Pandas DataFrame 块?

采纳答案by Andy Hayden

If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:

如果您知道这些长度相同,那么您可以直接从数组创建 DataFrame,然后附加每一列:

df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...


Note: you can also use the to_framemethod (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):

注意:您还可以使用该to_frame方法(它允许您(可选)传递一个名称 - 如果系列没有名称,这将很有用):

df = prcpSeries.to_frame(name='prcp')


However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):

但是,如果它们是可变长度的,那么这将丢失一些数据(任何长度超过 的数组prcpSeries)。这里的另一种方法是将每个创建为 DataFrame,然后执行外连接(使用concat):

df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...

df = pd.concat([df1, df2, ...], join='outer', axis=1)

For example:

例如:

In [21]: dfA = pd.DataFrame([1,2], columns=['A'])

In [22]: dfB = pd.DataFrame([1], columns=['B'])

In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
   A   B
0  1   1
1  2 NaN