Pandas to_csv 覆盖,防止数据丢失
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/42409707/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas to_csv overwriting, prevent data loss
提问by user1506145
I have a script that is constantly updating a data-frame and saving it to disk (overwriting the old csv-file). I found out that if interrupt the program right at the saving call, df.to_csv("df.csv")
, all data is losed, and the df.csv
is empty only containing the column-index.
我有一个脚本,它不断更新数据帧并将其保存到磁盘(覆盖旧的 csv 文件)。我发现如果在保存调用时中断程序df.to_csv("df.csv")
,所有数据都会丢失,并且df.csv
只包含列索引是空的。
I can perhaps do a workaround by temporarily saving the data to df.temp.csv
, and then replacing df.csv
. But is there a pythonic, short way to make the saving "Atomary" and prevent data-loss? This is the stack trace I get when interrupting right at the saving call.
我可以暂时保存的数据可能做一个解决办法df.temp.csv
,然后再更换df.csv
。但是有没有一种pythonic的,简短的方法来保存“Atomary”并防止数据丢失?这是我在保存调用中中断时得到的堆栈跟踪。
Traceback (most recent call last):
File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1531, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 938, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/user/test.py", line 49, in <module>
d.to_csv("out.csv", index=False)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1344, in to_csv
formatter.save()
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1551, in save
self._save()
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1652, in _save
self._save_chunk(start_i, end_i)
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1666, in _save_chunk
quoting=self.quoting)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 1443, in to_native_types
return formatter.get_result_as_array()
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2171, in get_result_as_array
formatted_values = format_values_with(float_format)
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2157, in format_values_with
for val in values.ravel()[imask]])
File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2108, in base_formatter
return str(v) if notnull(v) else self.na_rep
File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 250, in notnull
res = isnull(obj)
File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 73, in isnull
def isnull(obj):
File "_pydevd_bundle/pydevd_cython.pyx", line 937, in _pydevd_bundle.pydevd_cython.ThreadTracer.__call__ (_pydevd_bundle/pydevd_cython.c:15522)
File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/_pydev_bundle/pydev_is_thread_alive.py", line 14, in is_thread_alive
def is_thread_alive(t):
KeyboardInterrupt
回答by Blckknght
You can create a context manager to handle your atomic overwriting:
您可以创建一个上下文管理器来处理您的原子覆盖:
import os
import contextlib
@contextlib.contextmanager
def atomic_overwrite(filename):
temp = filename + '~'
with open(temp, "w") as f:
yield f
os.rename(temp, filename) # this will only happen if no exception was raised
The to_csv
method on a Pandas DataFrame
will accept a file object instead of a path, so you can use:
to_csv
Pandas 上的方法DataFrame
将接受文件对象而不是路径,因此您可以使用:
with atomic_overwrite("df.csv") as f:
df.to_csv(f)
The temporary filename I chose is the requested filename with a tilde at the end. You can of course change the code to use something else if you want. I'm also not exactly sure what mode the file should be opened with, you may need "wb"
instead of just "w"
.
我选择的临时文件名是请求的文件名,末尾带有波浪号。如果需要,您当然可以更改代码以使用其他内容。我也不确定应该用什么模式打开文件,您可能需要"wb"
而不仅仅是"w"
.
回答by languitar
The best you can do is to implement a signal handler (signal
module) which waits with terminating the program until the last write operation has finished.
你能做的最好的事情是实现一个信号处理程序(signal
模块),它等待终止程序,直到最后一个写操作完成。
Something along the lines (pseudo-code):
沿线的东西(伪代码):
import signal
import sys
import time
import pandas as pd
lock = threading.Lock()
def handler(signum, frame):
# ensure that latest data is written
sys.exit(1)
signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)
while True:
# might exit any time.
pd.to_csv(...)
time.sleep(1)