使用 numpy/pandas 在 Python 中读取 CSV 文件的最后 N 行

Question

提问by Yuxiang Wang

Is there a quick way to read the last N lines of a CSV file in Python, using numpyor pandas?

有没有一种快速的方法可以在 Python 中使用numpy或读取 CSV 文件的最后 N 行pandas？

I cannot do skip_headerin numpyor skiprowin pandasbecause the length of the file varies, and I would always need the last N rows.
I know I can use pure Python to read line by line from the last row of the file, but that would be very slow. I can do that if I have to, but a more efficient way with numpyor pandas(which is essentially using C) would be really appreciated.

我不能做skip_headerinnumpy或skiprowinpandas因为文件的长度不同，我总是需要最后 N 行。
我知道我可以使用纯 Python 从文件的最后一行逐行读取，但这会非常慢。如果必须的话，我可以这样做，但是使用numpyor pandas（本质上是使用 C）的更有效的方法将非常受欢迎。

Answer 1

回答by hpaulj

With a small 10 line test file I tried 2 approaches - parse the whole thing and select the last N lines, versus load all lines, but only parse the last N:

使用一个 10 行的小测试文件，我尝试了 2 种方法 - 解析整个内容并选择最后 N 行，而不是加载所有行，但只解析最后 N：

In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 μs per loop

In [1026]: %%timeit 
      ...: with open('stack38704949.txt','rb') as f:
      ...:      lines = f.readlines()
      ...: np.genfromtxt(lines[-5:],delimiter=',')

1000 loops, best of 3: 378 μs per loop

This was tagged as a duplicate of Efficiently Read last 'n' rows of CSV into DataFrame. The accepted answer there used

这被标记为Efficiently Read last 'n' rows of CSV into DataFrame 的副本。那里使用的公认答案

from collections import deque

and collected the last N lines in that structure. It also used StringIOto feed the lines to the parser, which is an unnecessary complication. genfromtxttakes input from anything that gives it lines, so a list of lines is just fine.

并收集该结构中的最后 N 行。它还用于StringIO将行提供给解析器，这是一种不必要的复杂化。 genfromtxt从任何给它行的东西中获取输入，所以行列表就可以了。

In [1031]: %%timeit 
      ...: with open('stack38704949.txt','rb') as f:
      ...:      lines = deque(f,5)
      ...: np.genfromtxt(lines,delimiter=',') 

1000 loops, best of 3: 382 μs per loop

Basically the same time as readlinesand slice.

readlines与切片基本相同。

dequemay have an advantage when the file is very large, and it gets costly to hang onto all the lines. I don't think it saves any file reading time. Lines still have to be read one by one.

deque当文件非常大时可能具有优势，并且挂在所有行上的成本很高。我认为它不会节省任何文件读取时间。仍然需要逐行阅读。

timings for the row_countfollowed by skip_headerapproach are slower; it requires reading the file twice. skip_headerstill has to read lines.

用于定时row_count，随后skip_header的方法较慢; 它需要读取文件两次。 skip_header还是要读几行。

In [1046]: %%timeit 
      ...: with open('stack38704949.txt',"r") as f:
      ...:       ...:     reader = csv.reader(f,delimiter = ",")
      ...:       ...:     data = list(reader)
      ...:       ...:     row_count = len(data)
      ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')

The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 μs per loop

For purposes of counting lines we don't need to use csv.reader, though it doesn't appear to cost much extra time.

出于计算行数的目的，我们不需要使用csv.reader，尽管它似乎不会花费太多额外的时间。

In [1048]: %%timeit 
      ...: with open('stack38704949.txt',"r") as f:
      ...:    lines=f.readlines()
      ...:    row_count = len(data)
      ...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')

1000 loops, best of 3: 736 μs per loop

Answer 2

回答by Israel Unterman

Option 1

选项1

You can read the entire file with numpy.genfromtxt, get it as a numpy array, and take the last N rows:

您可以使用读取整个文件numpy.genfromtxt，将其作为 numpy 数组获取，然后获取最后 N 行：

a = np.genfromtxt('filename', delimiter=',')
lastN = a[-N:]

Option 2

选项 2

You can do a similar thing with the usual file reading:

您可以使用通常的文件读取来做类似的事情：

with open('filename') as f:
    lastN = list(f)[-N:]

but this time you will get the list of last N lines, as strings.

但这次你会得到最后 N 行的列表，作为字符串。

Option 3 - without reading the entire file to memory

选项 3 - 无需将整个文件读入内存

We use a list of at most N items to hold each iteration the last N lines:

我们使用最多包含 N 个项目的列表来保存每次迭代的最后 N 行：

lines = []
N = 10
with open('csv01.txt') as f:
    for line in f:
        lines.append(line)
        if len(lines) > 10:
            lines.pop(0)

A real csv requires a minor change:

一个真正的 csv 需要一个小的改变：

import csv
...
with ...
    for line in csv.reader(f):
    ...

Answer 3

回答by Jason Brown

Use skiprowsparameter of pandasread_csv(), the tougher part is finding the number of lines in the csv. here's a possible solution:

使用skiprows参数pandasread_csv()，更难的部分是找到 csv 中的行数。这是一个可能的解决方案：

with open('filename',"r") as f:
    reader = csv.reader(f,delimiter = ",")
    data = list(reader)
    row_count = len(data)

df = pd.read_csv('filename', skiprows = row_count - N)

使用 numpy/pandas 在 Python 中读取 CSV 文件的最后 N 行

提问by Yuxiang Wang

回答by hpaulj

回答by Israel Unterman

回答by Jason Brown

相关推荐

最近更新

标签

使用 numpy/pandas 在 Python 中读取 CSV 文件的最后 N 行

提问by Yuxiang Wang

回答by hpaulj

回答by Israel Unterman

回答by Jason Brown

相关推荐

pandas Python 中的多元线性回归（PatsyError：模型缺少所需的结果变量）

pandas 跨数据框列应用模糊匹配并将结果保存在新列中

pandas Matplotlib：无法将字符串转换为浮点数

pandas 如何更改 iterrows() 的起始索引？

相关推荐

最近更新

标签