Python 使用Pandas创建DataFrame with Series,导致内存错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17165340/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Using Pandas to create DataFrame with Series, resulting in memory error
提问by Mattijn
I'm using Pandas library for remote sensing time series analysis. Eventually I would like to save my DataFrame to csv by using chunk-sizes, but I run into a little issue. My code generates 6 NumPy arrays that I convert to Pandas Series. Each of these Series contains a lot of items
我正在使用 Pandas 库进行遥感时间序列分析。最终我想通过使用块大小将我的 DataFrame 保存到 csv,但我遇到了一个小问题。我的代码生成 6 个 NumPy 数组,然后将其转换为 Pandas 系列。这些系列中的每一个都包含很多项目
>>> prcpSeries.shape
(12626172,)
I would like to add the Series into a Pandas DataFrame (df) so I can save them chunk by chunk to a csv file.
我想将系列添加到 Pandas DataFrame (df) 中,以便我可以将它们逐块保存到 csv 文件中。
d = {'prcp': pd.Series(prcpSeries),
'tmax': pd.Series(tmaxSeries),
'tmin': pd.Series(tminSeries),
'ndvi': pd.Series(ndviSeries),
'lstm': pd.Series(lstmSeries),
'evtm': pd.Series(evtmSeries)}
df = pd.DataFrame(d)
outFile ='F:/data/output/run1/_'+str(i)+'.out'
df.to_csv(outFile, header = False, chunksize = 1000)
d = None
df = None
But my code get stuck at following line giving a Memory Error
但是我的代码卡在以下行中,出现内存错误
df = pd.DataFrame(d)
Any suggestions? Is it possible to fill the Pandas DataFrame chunk by chunk?
有什么建议?是否可以逐块填充 Pandas DataFrame 块?
采纳答案by Andy Hayden
If you know each of these are the same length then you could create the DataFrame directly from the array and then append each column:
如果您知道这些长度相同,那么您可以直接从数组创建 DataFrame,然后附加每一列:
df = pd.DataFrame(prcpSeries, columns=['prcp'])
df['tmax'] = tmaxSeries
...
Note: you can also use the to_framemethod (which allows you to (optionally) pass a name - which is useful if the Series doesn't have one):
注意:您还可以使用该to_frame方法(它允许您(可选)传递一个名称 - 如果系列没有名称,这将很有用):
df = prcpSeries.to_frame(name='prcp')
However, if they are variable length then this will lose some data (any arrays which are longer than prcpSeries). An alternative here is to create each as a DataFrame and then perform an outer join (using concat):
但是,如果它们是可变长度的,那么这将丢失一些数据(任何长度超过 的数组prcpSeries)。这里的另一种方法是将每个创建为 DataFrame,然后执行外连接(使用concat):
df1 = pd.DataFrame(prcpSeries, columns=['prcp'])
df2 = pd.DataFrame(tmaxSeries, columns=['tmax'])
...
df = pd.concat([df1, df2, ...], join='outer', axis=1)
For example:
例如:
In [21]: dfA = pd.DataFrame([1,2], columns=['A'])
In [22]: dfB = pd.DataFrame([1], columns=['B'])
In [23]: pd.concat([dfA, dfB], join='outer', axis=1)
Out[23]:
A B
0 1 1
1 2 NaN

