pandas df 的流数据

Question

提问by Leb

I'm attempting to simulate the use of pandas to access a constantly changing file.

我试图模拟使用Pandas来访问不断变化的文件。

I have one file reading a csv file, adding a line to it then sleeping for a random time to simulate bulk input.

我有一个文件读取 csv 文件，向其中添加一行，然后随机休眠一段时间以模拟批量输入。

import pandas as pd
from time import sleep
import random

df2 = pd.DataFrame(data = [['test','trial']], index=None)

while True:
    df = pd.read_csv('data.csv', header=None)
    df.append(df2)
    df.to_csv('data.csv', index=False)
    sleep(random.uniform(0.025,0.3))

The second file is checking for change in data by outputting the shape of the dataframe:

第二个文件通过输出数据框的形状来检查数据的变化：

import pandas as pd

while True:
    df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
    print(df.shape)

The problem with that is while I'm getting the correct shape of the DF, there are certain times where it's outputting (0x2).

问题是当我得到 DF 的正确形状时，在某些时候它输出(0x2).

i.e.:

IE：

...
(10x2)
(10x2)
...
(10x2)
(0x2)
(11x2)
(11x2)
...

This does occur at somebut not between eachchange in shape (the file adding to dataframe).

这确实发生在某些但不是在每次形状更改之间（文件添加到数据帧）。

Knowing this happens when the first script is opening the file to add data, and the second script is unable to access it, hence (0x2), will this occur any data loss?

知道是第一个脚本打开文件添加数据时会发生这种情况，而第二个脚本无法访问它，因此（0x2），这会发生任何数据丢失吗？

I cannot directly access the stream, only the output file. Or are there any other possible solutions?

我不能直接访问流，只能访问输出文件。或者还有其他可能的解决方案吗？

Edit

编辑

The purpose of this is to load the new data only (I have a code that does that) and do analysis "on the fly". Some of the analysis will include output/sec, graphing (similar to stream plot), and few other numerical calculations.

这样做的目的是仅加载新数据（我有一个代码可以这样做）并“即时”进行分析。一些分析将包括输出/秒、图形（类似于流图）和其他一些数值计算。

The biggest issue is that I have access to the csv file only, and I need to be able to analyze the data as it comes without loss or delay.

最大的问题是我只能访问 csv 文件，并且我需要能够在没有丢失或延迟的情况下分析数据。

Answer 1

回答by Joshua Goldberg

One of the scripts is reading the file while the other is trying to write to the file. Both scripts cannot access the file at the same time. Like Padraic Cunningham says in the comments you can implement a lock file to solve this problem.

其中一个脚本正在读取文件，而另一个脚本正在尝试写入文件。两个脚本不能同时访问该文件。就像 Padraic Cunningham 在评论中所说的那样，您可以实现一个锁定文件来解决这个问题。

There is a python package that will do just that called lockfilewith documentation here.

有一个 python 包可以做到这一点，称为lockfile和文档here。

Here is your first script with the lockfile package implemented:

这是您实现了 lockfile 包的第一个脚本：

import pandas as pd
from time import sleep
import random
from lockfile import FileLock

df2 = pd.DataFrame(data = [['test','trial']], index=None)
lock = FileLock('data.lock')

while True:
    with lock:
        df = pd.read_csv('data.csv', header=None)
        df.append(df2)
        df.to_csv('data.csv', index=False)
    sleep(random.uniform(0.025,0.3))

Here is you second script with the lockfile package implemented:

这是实现了 lockfile 包的第二个脚本：

import pandas as pd
from time import sleep
from lockfile import FileLock

lock = FileLock('data.lock')

while True:
    with lock:
        df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
    print(df.shape)
    sleep(0.100)

I added a wait of 100ms so that I could slow down the output to the console.

我添加了 100 毫秒的等待，以便我可以减慢输出到控制台的速度。

These scripts will create a file called "data.lock" before accessing the "data.csv" file and delete the file "data.lock" after accessing the "data.csv" file. In either script, if the "data.lock" exists, the script will wait until the "data.lock" file no longer exists.

这些脚本会在访问“data.csv”文件之前创建一个名为“data.lock”的文件，并在访问“data.csv”文件后删除文件“data.lock”。在任一脚本中，如果“data.lock”存在，脚本将等待直到“data.lock”文件不再存在。

Answer 2

回答by Joshua Goldberg

Your simulation script reads and writes to the data.csv file. You can read and write concurrently if one script opens the file as write only and the other opens the file as read only.

您的模拟脚本读取和写入 data.csv 文件。如果一个脚本以只写方式打开文件而另一个脚本以只读方式打开文件，则您可以同时读取和写入。

With this in mind, I changed your simulation script for writing the file to the following:

考虑到这一点，我将用于写入文件的模拟脚本更改为以下内容：

from time import sleep
import random

while(True):
    with open("data.csv", 'a') as fp:
        fp.write(','.join(['0','1']))
        fp.write('\n')
    sleep(0.010)

In python, opening a file with 'a' means append as write only. Using 'a+' will append with read and write access. You must make sure that the code writing the file will only open the file as write-only, and your script that is reading the file must never attempt to write to the file. Otherwise, you will need to implement another solution.

在 python 中，用 'a' 打开一个文件意味着追加为只写。使用“a+”将附加读写访问权限。您必须确保写入文件的代码只会以只写方式打开文件，并且读取文件的脚本绝不能尝试写入文件。否则，您将需要实施另一个解决方案。

Now you should be able to read using your second script without the issue that you mention.

现在您应该可以使用您的第二个脚本进行阅读，而不会出现您提到的问题。

pandas df 的流数据

提问by Leb

回答by Joshua Goldberg

回答by Joshua Goldberg

相关推荐

最近更新

标签

pandas df 的流数据

提问by Leb

回答by Joshua Goldberg

回答by Joshua Goldberg

相关推荐

Pandas 删除索引类型列

pandas 如何将系列加入数据帧？

pandas python pandas用月份名称解析日期时间字符串

pandas 计算熊猫数据框中的数据类型

相关推荐

最近更新

标签