使用 numpy/pandas 在 Python 中读取 CSV 文件的最后 N 行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38704949/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Read the last N lines of a CSV file in Python with numpy / pandas
提问by Yuxiang Wang
Is there a quick way to read the last N lines of a CSV file in Python, using numpy
or pandas
?
有没有一种快速的方法可以在 Python 中使用numpy
或读取 CSV 文件的最后 N 行pandas
?
I cannot do
skip_header
innumpy
orskiprow
inpandas
because the length of the file varies, and I would always need the last N rows.I know I can use pure Python to read line by line from the last row of the file, but that would be very slow. I can do that if I have to, but a more efficient way with
numpy
orpandas
(which is essentially using C) would be really appreciated.
我不能做
skip_header
innumpy
或skiprow
inpandas
因为文件的长度不同,我总是需要最后 N 行。我知道我可以使用纯 Python 从文件的最后一行逐行读取,但这会非常慢。如果必须的话,我可以这样做,但是使用
numpy
orpandas
(本质上是使用 C)的更有效的方法将非常受欢迎。
回答by hpaulj
With a small 10 line test file I tried 2 approaches - parse the whole thing and select the last N lines, versus load all lines, but only parse the last N:
使用一个 10 行的小测试文件,我尝试了 2 种方法 - 解析整个内容并选择最后 N 行,而不是加载所有行,但只解析最后 N:
In [1025]: timeit np.genfromtxt('stack38704949.txt',delimiter=',')[-5:]
1000 loops, best of 3: 741 μs per loop
In [1026]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = f.readlines()
...: np.genfromtxt(lines[-5:],delimiter=',')
1000 loops, best of 3: 378 μs per loop
This was tagged as a duplicate of Efficiently Read last 'n' rows of CSV into DataFrame. The accepted answer there used
这被标记为Efficiently Read last 'n' rows of CSV into DataFrame 的副本。那里使用的公认答案
from collections import deque
and collected the last N lines in that structure. It also used StringIO
to feed the lines to the parser, which is an unnecessary complication. genfromtxt
takes input from anything that gives it lines, so a list of lines is just fine.
并收集该结构中的最后 N 行。它还用于StringIO
将行提供给解析器,这是一种不必要的复杂化。 genfromtxt
从任何给它行的东西中获取输入,所以行列表就可以了。
In [1031]: %%timeit
...: with open('stack38704949.txt','rb') as f:
...: lines = deque(f,5)
...: np.genfromtxt(lines,delimiter=',')
1000 loops, best of 3: 382 μs per loop
Basically the same time as readlines
and slice.
readlines
与切片基本相同。
deque
may have an advantage when the file is very large, and it gets costly to hang onto all the lines. I don't think it saves any file reading time. Lines still have to be read one by one.
deque
当文件非常大时可能具有优势,并且挂在所有行上的成本很高。我认为它不会节省任何文件读取时间。仍然需要逐行阅读。
timings for the row_count
followed by skip_header
approach are slower; it requires reading the file twice. skip_header
still has to read lines.
用于定时row_count
,随后skip_header
的方法较慢; 它需要读取文件两次。 skip_header
还是要读几行。
In [1046]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: ...: reader = csv.reader(f,delimiter = ",")
...: ...: data = list(reader)
...: ...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
The slowest run took 5.96 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 760 μs per loop
For purposes of counting lines we don't need to use csv.reader
, though it doesn't appear to cost much extra time.
出于计算行数的目的,我们不需要使用csv.reader
,尽管它似乎不会花费太多额外的时间。
In [1048]: %%timeit
...: with open('stack38704949.txt',"r") as f:
...: lines=f.readlines()
...: row_count = len(data)
...: np.genfromtxt('stack38704949.txt',skip_header=row_count-5,delimiter=',')
1000 loops, best of 3: 736 μs per loop
回答by Israel Unterman
Option 1
选项1
You can read the entire file with numpy.genfromtxt
, get it as a numpy array, and take the last N rows:
您可以使用 读取整个文件numpy.genfromtxt
,将其作为 numpy 数组获取,然后获取最后 N 行:
a = np.genfromtxt('filename', delimiter=',')
lastN = a[-N:]
Option 2
选项 2
You can do a similar thing with the usual file reading:
您可以使用通常的文件读取来做类似的事情:
with open('filename') as f:
lastN = list(f)[-N:]
but this time you will get the list of last N lines, as strings.
但这次你会得到最后 N 行的列表,作为字符串。
Option 3 - without reading the entire file to memory
选项 3 - 无需将整个文件读入内存
We use a list of at most N items to hold each iteration the last N lines:
我们使用最多包含 N 个项目的列表来保存每次迭代的最后 N 行:
lines = []
N = 10
with open('csv01.txt') as f:
for line in f:
lines.append(line)
if len(lines) > 10:
lines.pop(0)
A real csv requires a minor change:
一个真正的 csv 需要一个小的改变:
import csv
...
with ...
for line in csv.reader(f):
...
回答by Jason Brown
Use skiprows
parameter of pandas
read_csv()
, the tougher part is finding the number of lines in the csv. here's a possible solution:
使用skiprows
参数pandas
read_csv()
,更难的部分是找到 csv 中的行数。这是一个可能的解决方案:
with open('filename',"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
df = pd.read_csv('filename', skiprows = row_count - N)