带有大 .dta 文件的 Pandas read_stata()
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19744527/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas read_stata() with large .dta files
提问by Jonathan
I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:
我正在使用一个大约 3.3 GB 的 Stata .dta 文件,所以它很大但不是太大。我对使用 IPython 很感兴趣,并尝试使用 Pandas 导入 .dta 文件,但发生了一些奇怪的事情。我的盒子有 32 GB 的 RAM,尝试加载 .dta 文件会导致所有 RAM 都被使用(大约 30 分钟后)并且我的计算机停止运行。这“感觉”不对,因为我可以使用来自外部包的 read.dta() 在 R 中打开文件没问题,并且在 Stata 中使用该文件很好。我正在使用的代码是:
%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')
and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().
我在 Enthought 的 Canopy 程序中使用 IPython。'%time' 的原因是因为我有兴趣将其与 R 的 read.dta() 进行基准测试。
My questions are:
我的问题是:
- Is there something I am doing wrong that is resulting in Pandas having issues?
- Is there a workaround to get the data into a Pandas dataframe?
- 是不是我做错了什么导致Pandas出现问题?
- 是否有将数据放入 Pandas 数据框的解决方法?
回答by Abraham D Flaxman
Here is a little function that has been handy for me, using some pandasfeatures that might not have been available when the question was originally posed:
这是一个对我来说很方便的小功能,它使用了一些pandas最初提出问题时可能不可用的功能:
def load_large_dta(fname):
import sys
reader = pd.read_stata(fname, iterator=True)
df = pd.DataFrame()
try:
chunk = reader.get_chunk(100*1000)
while len(chunk) > 0:
df = df.append(chunk, ignore_index=True)
chunk = reader.get_chunk(100*1000)
print '.',
sys.stdout.flush()
except (StopIteration, KeyboardInterrupt):
pass
print '\nloaded {} rows'.format(len(df))
return df
I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.
我用这个在 100 分钟内加载了一个 11G 的 Stata 文件,如果我厌倦了等待并点击cntl-c.
回答by AZhao
For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.
对于所有在此页面结束的人,请将 Pandas 升级到最新版本。我在加载期间遇到了停顿的计算机(300 MB Stata 文件,但只有 8 GB 系统内存)的确切问题,从 v0.14 升级到 v0.16.2 很快就解决了这个问题。
Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)
目前,它是 v 0.16.2。尽管我不知道具体细节,但速度已经有了显着提高。请参阅:Stata 和 Python (Pandas) 之间最有效的 I/O 设置
回答by Jinhua Wang
There is a simpler way to solve it using Pandas' built-in function read_stata.
有一种更简单的方法可以使用 Pandas 的内置函数来解决它read_stata。
Assume your large file is named as large.dta.
假设您的大文件名为large.dta.
import pandas as pd
reader=pd.read_stata("large.dta",chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df=df.append(itm)
df.to_csv("large.csv")
回答by Roberto Ferrer
Question 1.
问题 1。
There's not much I can say about this.
对此我无话可说。
Question 2.
问题2。
Consider exporting your .dtafile to .csvusing Stata command outsheetor export delimitedand then using read_csv()in pandas. In fact, you could take the newly created .csvfile, use it as input for R and compare with pandas (if that's of interest). read_csvis likely to have had more testing than read_stata.
考虑将您的.dta文件导出为.csv使用 Stata 命令outsheet或export delimited然后read_csv()在 Pandas 中使用。事实上,您可以将新创建的.csv文件用作 R 的输入并与 Pandas 进行比较(如果您感兴趣的话)。read_csv可能比 进行了更多的测试read_stata。
Run help outsheetfor details of the exporting.
运行help outsheet以获取导出的详细信息。
回答by javier
You should not be reading a 3GB+ file into an in-memory data object, that's a recipe for disaster (and has nothing to do with pandas). The right way to do this is to mem-map the file and access the data as needed.
您不应该将 3GB 以上的文件读入内存数据对象,这是灾难的秘诀(与Pandas无关)。正确的方法是对文件进行内存映射并根据需要访问数据。
You should consider converting your file to a more appropriate format (csvor hdf) and then you can use the Daskwrapper around pandas DataFrame for chunk-loading the data as needed:
您应该考虑将文件转换为更合适的格式(csv或hdf),然后您可以使用围绕 pandas DataFrame的Dask包装器根据需要对数据进行块加载:
from dask import dataframe as dd
# If you don't want to use all the columns, make a selection
columns = ['column1', 'column2']
data = dd.read_csv('your_file.csv', use_columns=columns)
This will transparently take care of chunk-loading, multicore data handling and all that stuff.
这将透明地处理块加载、多核数据处理和所有这些。

