Python 如何使用 Pandas 存储数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17098654/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:27:14  来源:igfitidea点击:

How to store a dataframe using Pandas

pythonpandasdataframe

提问by jeffstern

Right now I'm importing a fairly large CSVas a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?

现在我CSV每次运行脚本时都会导入一个相当大的数据框。是否有一个很好的解决方案可以在运行之间保持该数据框始终可用,这样我就不必花费所有时间等待脚本运行?

采纳答案by Andy Hayden

The easiest way is to pickleit using to_pickle:

最简单的方法是使用以下方法对其进行腌制to_pickle

df.to_pickle(file_name)  # where to save it, usually as a .pkl

Then you can load it back using:

然后您可以使用以下方法加载它:

df = pd.read_pickle(file_name)

Note: before 0.11.1 saveand loadwere the only way to do this (they are now deprecated in favor of to_pickleand read_picklerespectively).

注意:在 0.11.1 之前saveandload是唯一的方法(它们现在分别被弃用to_pickleread_pickle分别支持和)。



Another popular choice is to use HDF5(pytables) which offers very fastaccess times for large datasets:

另一个流行的选择是使用HDF5( pytables),它为大型数据集提供非常快的访问时间:

store = HDFStore('store.h5')

store['df'] = df  # save it
store['df']  # load it

More advanced strategies are discussed in the cookbook.

食谱中讨论了更高级的策略。



Since 0.13 there's also msgpackwhich may be be better for interoperability, as a faster alternative to JSON, or if you have python object/text-heavy data (see this question).

从 0.13 开始,还有msgpack可能更适合互操作性,作为 JSON 的更快替代品,或者如果您有 python 对象/文本重数据(请参阅此问题)。

回答by Noah

If I understand correctly, you're already using pandas.read_csv()but would like to speed up the development process so that you don't have to load the file in every time you edit your script, is that right? I have a few recommendations:

如果我理解正确,您已经在使用pandas.read_csv()但希望加快开发过程,这样您就不必每次编辑脚本时都加载文件,对吗?我有几点建议:

  1. you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000)to only load the top bit of the table, while you're doing the development

  2. use ipythonfor an interactive session, such that you keep the pandas table in memory as you edit and reload your script.

  3. convert the csv to an HDF5 table

  4. updateduse DataFrame.to_feather()and pd.read_feather()to store data in the R-compatible featherbinary format that is super fast (in my hands, slightly faster than pandas.to_pickle()on numeric data and much faster on string data).

  1. pandas.read_csv(..., nrows=1000)在进行开发时,您可以仅加载CSV 文件的一部分以仅加载表的顶部位

  2. ipython用于交互式会话,以便在编辑和重新加载脚本时将Pandas表保留在内存中。

  3. 将 csv 转换为HDF5 表

  4. 更新使用DataFrame.to_feather()pd.read_feather()以超快的 R 兼容羽毛二进制格式存储数据(在我手中,比pandas.to_pickle()数字数据略快,在字符串数据上快得多)。

You might also be interested in this answeron stackoverflow.

您可能也对 stackoverflow 上的这个答案感兴趣。

回答by Anbarasu Ramachandran

Pickle works good!

泡菜效果很好!

import pandas as pd
df.to_pickle('123.pkl')    #to save the dataframe, df to 123.pkl
df1 = pd.read_pickle('123.pkl') #to load 123.pkl back to the dataframe df

回答by agold

Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.

虽然已经有一些答案,但我找到了一个很好的比较,他们尝试了几种方法来序列化 Pandas DataFrames:Efficiently Store Pandas DataFrames

They compare:

他们比较:

  • pickle: original ASCII data format
  • cPickle, a C library
  • pickle-p2: uses the newer binary format
  • json: standardlib json library
  • json-no-index: like json, but without index
  • msgpack: binary JSON alternative
  • CSV
  • hdfstore: HDF5 storage format
  • pickle:原始ASCII数据格式
  • cPickle,一个 C 库
  • pickle-p2:使用较新的二进制格式
  • json:standardlib json 库
  • json-no-index:类似于 json,但没有索引
  • msgpack:二进制 JSON 替代方案
  • CSV
  • hdfstore:HDF5 存储格式

In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:

在他们的实验中,他们序列化了一个包含 1,000,000 行的 DataFrame,其中两列分别测试:一列包含文本数据,另一列包含数字。他们的免责声明说:

You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself

您不应该相信以下内容可以概括为您的数据。您应该查看自己的数据并自己运行基准测试

The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.pyI got the following results:

他们所引用的测试的源代码可在线获得。由于这段代码不能直接工作,我做了一些小改动,你可以在这里得到:serialize.py我得到了以下结果:

time comparison results

时间对比结果

They also mention that with the conversion of text data to categoricaldata the serialization is much faster. In their test about 10 times as fast (also see the test code).

他们还提到,通过将文本数据转换为分类数据,序列化速度要快得多。在他们的测试中大约快 10 倍(另请参阅测试代码)。

Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickleuses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.

编辑:pickle 比 CSV 更高的时间可以通过使用的数据格式来解释。默认情况下pickle使用可打印的 ASCII 表示,它会生成更大的数据集。然而,从图中可以看出,使用较新的二进制数据格式(版本 2, pickle-p2)的pickle 的加载时间要低得多。

Some other references:

其他一些参考:

回答by mgoldwasser

Pandas DataFrames have the to_picklefunction which is useful for saving a DataFrame:

Pandas DataFrames 具有to_pickle用于保存 DataFrame 的功能:

import pandas as pd

a = pd.DataFrame({'A':[0,1,0,1,0],'B':[True, True, False, False, False]})
print a
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

a.to_pickle('my_file.pkl')

b = pd.read_pickle('my_file.pkl')
print b
#    A      B
# 0  0   True
# 1  1   True
# 2  0  False
# 3  1  False
# 4  0  False

回答by mark jay

Numpy file formats are pretty fast for numerical data

Numpy 文件格式对于数值数据来说非常快

I prefer to use numpy files since they're fast and easy to work with. Here's a simple benchmark for saving and loading a dataframe with 1 column of 1million points.

我更喜欢使用 numpy 文件,因为它们快速且易于使用。这是保存和加载具有 1 列 100 万个点的数据帧的简单基准。

import numpy as np
import pandas as pd

num_dict = {'voltage': np.random.rand(1000000)}
num_df = pd.DataFrame(num_dict)

using ipython's %%timeitmagic function

使用 ipython 的%%timeit魔法功能

%%timeit
with open('num.npy', 'wb') as np_file:
    np.save(np_file, num_df)

the output is

输出是

100 loops, best of 3: 5.97 ms per loop

to load the data back into a dataframe

将数据加载回数据帧

%%timeit
with open('num.npy', 'rb') as np_file:
    data = np.load(np_file)

data_df = pd.DataFrame(data)

the output is

输出是

100 loops, best of 3: 5.12 ms per loop

NOT BAD!

不错!

CONS

缺点

There's a problem if you save the numpy file using python 2 and then try opening using python 3 (or vice versa).

如果您使用 python 2 保存 numpy 文件,然后尝试使用 python 3 打开(反之亦然),则会出现问题。

回答by Huanyu Liao

You can use feather format file. It is extremely fast.

您可以使用羽毛格式文件。它非常快。

df.to_feather('filename.ft')

回答by Anirban Manna

import pickle

example_dict = {1:"6",2:"2",3:"g"}

pickle_out = open("dict.pickle","wb")
pickle.dump(example_dict, pickle_out)
pickle_out.close()

The above code will save the pickle file

上面的代码将保存pickle文件

pickle_in = open("dict.pickle","rb")
example_dict = pickle.load(pickle_in)

This two lines will open the saved pickle file

这两行将打开保存的pickle文件

回答by Michael Dorner

As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickleis not a first-class citizen (depending on your setup), because:

如前所述,有不同的选项和文件格式(HDF5JSONCSVparquetSQL)来存储数据框。但是,pickle不是一等公民(取决于您的设置),因为:

  1. pickleis a potential security risk. Form the Python documentation for pickle:
  1. pickle是潜在的安全风险。形成picklePython 文档

WarningThe picklemodule is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

警告pickle模块对于错误或恶意构造的数据不安全。永远不要解开从不受信任或未经身份验证的来源收到的数据。

  1. pickleis slow. Find hereand herebenchmarks.
  1. pickle是慢的。在这里这里找到基准。

Depending on your setup/usage both limitations do not apply, but I would not recommend pickleas the default persistence for pandas data frames.

根据您的设置/使用情况,这两个限制都不适用,但我不建议pickle将其作为 Pandas 数据帧的默认持久性。

回答by Gilco

https://docs.python.org/3/library/pickle.html

https://docs.python.org/3/library/pickle.html

The pickle protocol formats:

pickle 协议格式:

Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

协议版本 0 是原始的“人类可读”协议,向后兼容早期版本的 Python。

Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.

协议版本 1 是一种旧的二进制格式,它也与早期版本的 Python 兼容。

Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.

协议版本 2 是在 Python 2.3 中引入的。它提供了更有效的新型类酸洗。有关协议 2 带来的改进的信息,请参阅 PEP 307。

Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.

Python 3.0 中添加了协议版本 3。它对字节对象有明确的支持,并且不能被 Python 2.x 取消。这是默认协议,当需要兼容其他 Python 3 版本时推荐使用的协议。

Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.

Python 3.4 中添加了协议版本 4。它增加了对超大对象的支持,酸洗更多种类的对象,以及一些数据格式优化。有关协议 4 带来的改进的信息,请参阅 PEP 3154。