Python Pandas 持久缓存
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51235360/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python pandas persistent cache
提问by Luca C.
Is there an implementation for python pandas that cache the data on disk so I can avoid to reproduce it every time?
是否有将数据缓存在磁盘上的 python pandas 实现,这样我就可以避免每次都重现它?
In particular is there a caching method for get_yahoo_data
for financial?
特别是get_yahoo_data
对于财务有没有缓存方法?
A very plus would be:
一个非常加分的是:
- very few lines of code to write
- possibility to integrate the persisted series when new data is downloaded for the same source
- 很少的代码行
- 当为同一源下载新数据时,可以集成持久化系列
回答by nijm
There are many ways to achieve this, however probably the easiest way is to use the build in methods for writing and reading Python pickles. You can use pandas.DataFrame.to_pickle
to store the DataFrame to disk and pandas.read_pickle
to read the stored DataFrame from disk.
有很多方法可以实现这一点,但最简单的方法可能是使用内置方法来编写和读取Python pickles。您可以使用pandas.DataFrame.to_pickle
将 DataFrame 存储到磁盘并pandas.read_pickle
从磁盘读取存储的 DataFrame。
An example for a pandas.DataFrame
:
一个例子pandas.DataFrame
:
# Store your DataFrame
df.to_pickle('cached_dataframe.pkl') # will be stored in current directory
# Read your DataFrame
df = pandas.read_pickle('cached_dataframe.pkl') # read from current directory
The same methods also work for pandas.Series
:
同样的方法也适用于pandas.Series
:
# Store your Series
series.to_pickle('cached_series.pkl') # will be stored in current directory
# Read your DataFrame
series = pandas.read_pickle('cached_series.pkl') # read from current directory
回答by YaOzI
Depend on different requirements, there are a dozen of methodsto do that, to and fro, in CSV, Excel, JSON, Python Pickle Format, HDF5 and even SQL with DB, etc.
根据不同的要求,有十几种方法可以做到这一点,在 CSV、Excel、JSON、Python Pickle 格式、HDF5 甚至 SQL 与 DB 等中。
In terms of code lines, to/read
many of these formats are just one line of code for each direction. Python and Pandas already make the code as clean as possible, so you could worry less about that.
在代码行方面,to/read
许多这些格式只是每个方向一行代码。Python 和 Pandas 已经使代码尽可能干净,因此您可以少担心。
I think there is no single solution to fit all requirements, really case by case:
我认为没有单一的解决方案可以满足所有要求,具体情况具体如下:
- for human readability of saved data: CSV, Excel
- for binary python object serialization (use-cases): Pickle
- for data-interchange: JSON
- for long-time and incrementally updating: SQL
- etc.
- 保存数据的可读性:CSV、Excel
- 对于二进制 Python 对象序列化(用例):Pickle
- 用于数据交换:JSON
- 用于长期和增量更新:SQL
- 等等。
And if you want to daily update the stock prices and for later usage, I prefer Pandas with SQL Queries, of course this will add few lines of code to set up DB connection:
如果你想每天更新股票价格并供以后使用,我更喜欢Pandas with SQL Queries,当然这将添加几行代码来设置数据库连接:
from sqlalchemy import create_engine
new_data = getting_daily_price()
# You can also choose other db drivers instead of `sqlalchemy`
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
new_data.to_sql('table_name', conn) # To Write
df = pd.read_sql_table('sql_query', conn) # To Read