pandas df 的流数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32594137/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Streaming data for pandas df
提问by Leb
I'm attempting to simulate the use of pandas to access a constantly changing file.
我试图模拟使用Pandas来访问不断变化的文件。
I have one file reading a csv file, adding a line to it then sleeping for a random time to simulate bulk input.
我有一个文件读取 csv 文件,向其中添加一行,然后随机休眠一段时间以模拟批量输入。
import pandas as pd
from time import sleep
import random
df2 = pd.DataFrame(data = [['test','trial']], index=None)
while True:
df = pd.read_csv('data.csv', header=None)
df.append(df2)
df.to_csv('data.csv', index=False)
sleep(random.uniform(0.025,0.3))
The second file is checking for change in data by outputting the shape of the dataframe:
第二个文件通过输出数据框的形状来检查数据的变化:
import pandas as pd
while True:
df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
print(df.shape)
The problem with that is while I'm getting the correct shape of the DF, there are certain times where it's outputting (0x2).
问题是当我得到 DF 的正确形状时,在某些时候它输出(0x2).
i.e.:
IE:
...
(10x2)
(10x2)
...
(10x2)
(0x2)
(11x2)
(11x2)
...
This does occur at somebut not between eachchange in shape (the file adding to dataframe).
这确实发生在某些但不是在每次形状更改之间(文件添加到数据帧)。
Knowing this happens when the first script is opening the file to add data, and the second script is unable to access it, hence (0x2), will this occur any data loss?
知道是第一个脚本打开文件添加数据时会发生这种情况,而第二个脚本无法访问它,因此(0x2),这会发生任何数据丢失吗?
I cannot directly access the stream, only the output file. Or are there any other possible solutions?
我不能直接访问流,只能访问输出文件。或者还有其他可能的解决方案吗?
Edit
编辑
The purpose of this is to load the new data only (I have a code that does that) and do analysis "on the fly". Some of the analysis will include output/sec, graphing (similar to stream plot), and few other numerical calculations.
这样做的目的是仅加载新数据(我有一个代码可以这样做)并“即时”进行分析。一些分析将包括输出/秒、图形(类似于流图)和其他一些数值计算。
The biggest issue is that I have access to the csv file only, and I need to be able to analyze the data as it comes without loss or delay.
最大的问题是我只能访问 csv 文件,并且我需要能够在没有丢失或延迟的情况下分析数据。
回答by Joshua Goldberg
One of the scripts is reading the file while the other is trying to write to the file. Both scripts cannot access the file at the same time. Like Padraic Cunningham says in the comments you can implement a lock file to solve this problem.
其中一个脚本正在读取文件,而另一个脚本正在尝试写入文件。两个脚本不能同时访问该文件。就像 Padraic Cunningham 在评论中所说的那样,您可以实现一个锁定文件来解决这个问题。
There is a python package that will do just that called lockfilewith documentation here.
有一个 python 包可以做到这一点,称为lockfile和文档here。
Here is your first script with the lockfile package implemented:
这是您实现了 lockfile 包的第一个脚本:
import pandas as pd
from time import sleep
import random
from lockfile import FileLock
df2 = pd.DataFrame(data = [['test','trial']], index=None)
lock = FileLock('data.lock')
while True:
with lock:
df = pd.read_csv('data.csv', header=None)
df.append(df2)
df.to_csv('data.csv', index=False)
sleep(random.uniform(0.025,0.3))
Here is you second script with the lockfile package implemented:
这是实现了 lockfile 包的第二个脚本:
import pandas as pd
from time import sleep
from lockfile import FileLock
lock = FileLock('data.lock')
while True:
with lock:
df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
print(df.shape)
sleep(0.100)
I added a wait of 100ms so that I could slow down the output to the console.
我添加了 100 毫秒的等待,以便我可以减慢输出到控制台的速度。
These scripts will create a file called "data.lock" before accessing the "data.csv" file and delete the file "data.lock" after accessing the "data.csv" file. In either script, if the "data.lock" exists, the script will wait until the "data.lock" file no longer exists.
这些脚本会在访问“data.csv”文件之前创建一个名为“data.lock”的文件,并在访问“data.csv”文件后删除文件“data.lock”。在任一脚本中,如果“data.lock”存在,脚本将等待直到“data.lock”文件不再存在。
回答by Joshua Goldberg
Your simulation script reads and writes to the data.csv file. You can read and write concurrently if one script opens the file as write only and the other opens the file as read only.
您的模拟脚本读取和写入 data.csv 文件。如果一个脚本以只写方式打开文件而另一个脚本以只读方式打开文件,则您可以同时读取和写入。
With this in mind, I changed your simulation script for writing the file to the following:
考虑到这一点,我将用于写入文件的模拟脚本更改为以下内容:
from time import sleep
import random
while(True):
with open("data.csv", 'a') as fp:
fp.write(','.join(['0','1']))
fp.write('\n')
sleep(0.010)
In python, opening a file with 'a' means append as write only. Using 'a+' will append with read and write access. You must make sure that the code writing the file will only open the file as write-only, and your script that is reading the file must never attempt to write to the file. Otherwise, you will need to implement another solution.
在 python 中,用 'a' 打开一个文件意味着追加为只写。使用“a+”将附加读写访问权限。您必须确保写入文件的代码只会以只写方式打开文件,并且读取文件的脚本绝不能尝试写入文件。否则,您将需要实施另一个解决方案。
Now you should be able to read using your second script without the issue that you mention.
现在您应该可以使用您的第二个脚本进行阅读,而不会出现您提到的问题。

