Pandas to_csv 覆盖,防止数据丢失

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42409707/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 03:02:54  来源:igfitidea点击:

Pandas to_csv overwriting, prevent data loss

pythoncsvpandas

提问by user1506145

I have a script that is constantly updating a data-frame and saving it to disk (overwriting the old csv-file). I found out that if interrupt the program right at the saving call, df.to_csv("df.csv"), all data is losed, and the df.csvis empty only containing the column-index.

我有一个脚本,它不断更新数据帧并将其保存到磁盘(覆盖旧的 csv 文件)。我发现如果在保存调用时中断程序df.to_csv("df.csv"),所有数据都会丢失,并且df.csv只包含列索引是空的。

I can perhaps do a workaround by temporarily saving the data to df.temp.csv, and then replacing df.csv. But is there a pythonic, short way to make the saving "Atomary" and prevent data-loss? This is the stack trace I get when interrupting right at the saving call.

我可以暂时保存的数据可能做一个解决办法df.temp.csv,然后再更换df.csv。但是有没有一种pythonic的,简短的方法来保存“Atomary”并防止数据丢失?这是我在保存调用中中断时得到的堆栈跟踪。

Traceback (most recent call last):
  File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1531, in <module>
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 938, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Users/user/test.py", line 49, in <module>
    d.to_csv("out.csv", index=False)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1344, in to_csv
    formatter.save()
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1551, in save
    self._save()
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1652, in _save
    self._save_chunk(start_i, end_i)
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 1666, in _save_chunk
    quoting=self.quoting)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 1443, in to_native_types
    return formatter.get_result_as_array()
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2171, in get_result_as_array
    formatted_values = format_values_with(float_format)
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2157, in format_values_with
    for val in values.ravel()[imask]])
  File "/usr/local/lib/python2.7/site-packages/pandas/formats/format.py", line 2108, in base_formatter
    return str(v) if notnull(v) else self.na_rep
  File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 250, in notnull
    res = isnull(obj)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/common.py", line 73, in isnull
    def isnull(obj):
  File "_pydevd_bundle/pydevd_cython.pyx", line 937, in _pydevd_bundle.pydevd_cython.ThreadTracer.__call__ (_pydevd_bundle/pydevd_cython.c:15522)
  File "/opt/homebrew-cask/Caskroom/pycharm/2016.1.3/PyCharm.app/Contents/helpers/pydev/_pydev_bundle/pydev_is_thread_alive.py", line 14, in is_thread_alive
    def is_thread_alive(t):
KeyboardInterrupt

回答by Blckknght

You can create a context manager to handle your atomic overwriting:

您可以创建一个上下文管理器来处理您的原子覆盖:

import os
import contextlib

@contextlib.contextmanager
def atomic_overwrite(filename):
    temp = filename + '~'
    with open(temp, "w") as f:
        yield f
    os.rename(temp, filename) # this will only happen if no exception was raised

The to_csvmethod on a Pandas DataFramewill accept a file object instead of a path, so you can use:

to_csvPandas 上的方法DataFrame将接受文件对象而不是路径,因此您可以使用:

with atomic_overwrite("df.csv") as f:
    df.to_csv(f)

The temporary filename I chose is the requested filename with a tilde at the end. You can of course change the code to use something else if you want. I'm also not exactly sure what mode the file should be opened with, you may need "wb"instead of just "w".

我选择的临时文件名是请求的文件名,末尾带有波浪号。如果需要,您当然可以更改代码以使用其他内容。我也不确定应该用什么模式打开文件,您可能需要"wb"而不仅仅是"w".

回答by languitar

The best you can do is to implement a signal handler (signalmodule) which waits with terminating the program until the last write operation has finished.

你能做的最好的事情是实现一个信号处理程序(signal模块),它等待终止程序,直到最后一个写操作完成。

Something along the lines (pseudo-code):

沿线的东西(伪代码):

import signal
import sys
import time
import pandas as pd

lock = threading.Lock()

def handler(signum, frame):
    # ensure that latest data is written
    sys.exit(1)

signal.signal(signal.SIGTERM, handler)
signal.signal(signal.SIGINT, handler)

while True:
    # might exit any time.
    pd.to_csv(...)
    time.sleep(1)