Python 如何使用 Pandas 存储数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17098654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to store a dataframe using Pandas
提问by jeffstern
Right now I'm importing a fairly large CSVas a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?
现在我CSV每次运行脚本时都会导入一个相当大的数据框。是否有一个很好的解决方案可以在运行之间保持该数据框始终可用,这样我就不必花费所有时间等待脚本运行?
采纳答案by Andy Hayden
The easiest way is to pickleit using to_pickle:
df.to_pickle(file_name) # where to save it, usually as a .pkl
Then you can load it back using:
然后您可以使用以下方法加载它:
df = pd.read_pickle(file_name)
Note: before 0.11.1 saveand loadwere the only way to do this (they are now deprecated in favor of to_pickleand read_picklerespectively).
注意:在 0.11.1 之前saveandload是唯一的方法(它们现在分别被弃用to_pickle而read_pickle分别支持和)。
Another popular choice is to use HDF5(pytables) which offers very fastaccess times for large datasets:
另一个流行的选择是使用HDF5( pytables),它为大型数据集提供非常快的访问时间:
store = HDFStore('store.h5')
store['df'] = df # save it
store['df'] # load it
More advanced strategies are discussed in the cookbook.
食谱中讨论了更高级的策略。
Since 0.13 there's also msgpackwhich may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).
从 0.13 开始,还有msgpack可能更适合互操作性,作为 JSON 的更快替代品,或者如果您有 python 对象/文本重数据(请参阅此问题)。
回答by Noah
If I understand correctly, you're already using pandas.read_csv()but would like to speed up the development process so that you don't have to load the file in every time you edit your script, is that right? I have a few recommendations:
如果我理解正确,您已经在使用pandas.read_csv()但希望加快开发过程,这样您就不必每次编辑脚本时都加载文件,对吗?我有几点建议:
you could load in only part of the CSV file using
pandas.read_csv(..., nrows=1000)to only load the top bit of the table, while you're doing the developmentuse ipythonfor an interactive session, such that you keep the pandas table in memory as you edit and reload your script.
convert the csv to an HDF5 table
updateduse
DataFrame.to_feather()andpd.read_feather()to store data in the R-compatible featherbinary format that is super fast (in my hands, slightly faster thanpandas.to_pickle()on numeric data and much faster on string data).
pandas.read_csv(..., nrows=1000)在进行开发时,您可以仅加载CSV 文件的一部分以仅加载表的顶部位将 csv 转换为HDF5 表
更新使用
DataFrame.to_feather()并pd.read_feather()以超快的 R 兼容羽毛二进制格式存储数据(在我手中,比pandas.to_pickle()数字数据略快,在字符串数据上快得多)。
You might also be interested in this answeron stackoverflow.
您可能也对 stackoverflow 上的这个答案感兴趣。
回答by Anbarasu Ramachandran
Pickle works good!
泡菜效果很好!
import pandas as pd
df.to_pickle('123.pkl') #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df
回答by agold
Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.
虽然已经有一些答案,但我找到了一个很好的比较,他们尝试了几种方法来序列化 Pandas DataFrames:Efficiently Store Pandas DataFrames。
They compare:
他们比较:
- pickle: original ASCII data format
- cPickle, a C library
- pickle-p2: uses the newer binary format
- json: standardlib json library
- json-no-index: like json, but without index
- msgpack: binary JSON alternative
- CSV
- hdfstore: HDF5 storage format
- pickle:原始ASCII数据格式
- cPickle,一个 C 库
- pickle-p2:使用较新的二进制格式
- json:standardlib json 库
- json-no-index:类似于 json,但没有索引
- msgpack:二进制 JSON 替代方案
- CSV
- hdfstore:HDF5 存储格式
In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:
在他们的实验中,他们序列化了一个包含 1,000,000 行的 DataFrame,其中两列分别测试:一列包含文本数据,另一列包含数字。他们的免责声明说:
You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself
您不应该相信以下内容可以概括为您的数据。您应该查看自己的数据并自己运行基准测试
The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.pyI got the following results:
他们所引用的测试的源代码可在线获得。由于这段代码不能直接工作,我做了一些小改动,你可以在这里得到:serialize.py我得到了以下结果:
They also mention that with the conversion of text data to categoricaldata the serialization is much faster. In their test about 10 times as fast (also see the test code).
他们还提到,通过将文本数据转换为分类数据,序列化速度要快得多。在他们的测试中大约快 10 倍(另请参阅测试代码)。
Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickleuses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.
编辑:pickle 比 CSV 更高的时间可以通过使用的数据格式来解释。默认情况下pickle使用可打印的 ASCII 表示,它会生成更大的数据集。然而,从图中可以看出,使用较新的二进制数据格式(版本 2, pickle-p2)的pickle 的加载时间要低得多。
Some other references:
其他一些参考:
- In the question Fastest Python library to read a CSV filethere is a very detailed answerwhich compares different libraries to read csv files with a benchmark. The result is that for reading csv files
numpy.fromfileis the fastest. - Another serialization testshows msgpack, ujson, and cPickle to be the quickest in serializing.
- 在读取 CSV 文件的最快 Python 库问题中,有一个非常详细的答案,它比较了不同的库以使用基准读取 csv 文件。结果是读取csv文件
numpy.fromfile是最快的。 - 另一个序列化测试显示msgpack、ujson和 cPickle 是最快的序列化。
回答by mgoldwasser
Pandas DataFrames have the to_picklefunction which is useful for saving a DataFrame:
Pandas DataFrames 具有to_pickle用于保存 DataFrame 的功能:
import pandas as pd
a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
# A B
# 0 0 True
# 1 1 True
# 2 0 False
# 3 1 False
# 4 0 False
a.to_pickle('my_file.pkl')
b = pd.read_pickle('my_file.pkl')
print b
# A B
# 0 0 True
# 1 1 True
# 2 0 False
# 3 1 False
# 4 0 False
回答by mark jay
Numpy file formats are pretty fast for numerical data
Numpy 文件格式对于数值数据来说非常快
I prefer to use numpy files since they're fast and easy to work with. Here's a simple benchmark for saving and loading a dataframe with 1 column of 1million points.
我更喜欢使用 numpy 文件,因为它们快速且易于使用。这是保存和加载具有 1 列 100 万个点的数据帧的简单基准。
import numpy as np
import pandas as pd
num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)
using ipython's %%timeitmagic function
使用 ipython 的%%timeit魔法功能
%%timeit
with open('num.npy', 'wb') as np_file:
np.save(np_file, num_df)
the output is
输出是
100 loops, best of 3: 5.97 ms per loop
to load the data back into a dataframe
将数据加载回数据帧
%%timeit
with open('num.npy', 'rb') as np_file:
data = np.load(np_file)
data_df = pd.DataFrame(data)
the output is
输出是
100 loops, best of 3: 5.12 ms per loop
NOT BAD!
不错!
CONS
缺点
There's a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).
如果您使用 python 2 保存 numpy 文件,然后尝试使用 python 3 打开(反之亦然),则会出现问题。
回答by Huanyu Liao
You can use feather format file. It is extremely fast.
您可以使用羽毛格式文件。它非常快。
df.to_feather('filename.ft')
回答by Anirban Manna
import pickle
example_dict = {1:"6",2:"2",3:"g"}
pickle_out = open("dict.pickle","wb")
pickle.dump(example_dict, pickle_out)
pickle_out.close()
The above code will save the pickle file
上面的代码将保存pickle文件
pickle_in = open("dict.pickle","rb")
example_dict = pickle.load(pickle_in)
This two lines will open the saved pickle file
这两行将打开保存的pickle文件
回答by Michael Dorner
As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickleis not a first-class citizen (depending on your setup), because:
如前所述,有不同的选项和文件格式(HDF5、JSON、CSV、parquet、SQL)来存储数据框。但是,pickle不是一等公民(取决于您的设置),因为:
pickleis a potential security risk. Form the Python documentation for pickle:
WarningThe
picklemodule is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
警告该
pickle模块对于错误或恶意构造的数据不安全。永远不要解开从不受信任或未经身份验证的来源收到的数据。
Depending on your setup/usage both limitations do not apply, but I would not recommend pickleas the default persistence for pandas data frames.
根据您的设置/使用情况,这两个限制都不适用,但我不建议pickle将其作为 Pandas 数据帧的默认持久性。
回答by Gilco
https://docs.python.org/3/library/pickle.html
https://docs.python.org/3/library/pickle.html
The pickle protocol formats:
pickle 协议格式:
Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.
协议版本 0 是原始的“人类可读”协议,向后兼容早期版本的 Python。
Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.
协议版本 1 是一种旧的二进制格式,它也与早期版本的 Python 兼容。
Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.
协议版本 2 是在 Python 2.3 中引入的。它提供了更有效的新型类酸洗。有关协议 2 带来的改进的信息,请参阅 PEP 307。
Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.
Python 3.0 中添加了协议版本 3。它对字节对象有明确的支持,并且不能被 Python 2.x 取消。这是默认协议,当需要兼容其他 Python 3 版本时推荐使用的协议。
Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.
Python 3.4 中添加了协议版本 4。它增加了对超大对象的支持,酸洗更多种类的对象,以及一些数据格式优化。有关协议 4 带来的改进的信息,请参阅 PEP 3154。


