pandas 有效地将 CSV 的最后“n”行读入 DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17108250/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Efficiently Read last 'n' rows of CSV into DataFrame
提问by Nipun Batra
A few methods to do this:
有几种方法可以做到这一点:
- Read the entire CSV and then use
df.tail - Somehow reverse the file (whats the best way to do this for large files?) and then use
nrowsargument to read - Somehow find the number of rows in the CSV, then use
skiprowsand read required number of rows. - Maybe do chunk read discarding initial chunks (though not sure how this would work)
- 阅读整个CSV,然后使用
df.tail - 以某种方式反转文件(对大文件执行此操作的最佳方法是什么?)然后使用
nrows参数读取 - 以某种方式找到 CSV 中的行数,然后使用
skiprows并读取所需的行数。 - 也许做块读取丢弃初始块(虽然不确定这将如何工作)
Can it be done in some easier way? If not, which amongst these three should be prefered and why?
可以以更简单的方式完成吗?如果不是,应该优先选择这三个中的哪一个,为什么?
Possibly related:
可能相关:
- Efficiently finding the last line in a text file
- Reading parts of ~13000 row CSV file with pandas read_csv and nrows
Not directly related:
没有直接关系:
回答by Andy Hayden
I don't think pandas offers a way to do this in read_csv.
我不认为 Pandas 提供了一种在read_csv.
Perhaps the neatest (in one pass) is to use collections.deque:
也许最简洁的(一次通过)是使用collections.deque:
from collections import deque
from StringIO import StringIO
with open(fname, 'r') as f:
q = deque(f, 2) # replace 2 with n (lines read at the end)
In [12]: q
Out[12]: deque(['7,8,9\n', '10,11,12'], maxlen=2)
# these are the last two lines of my csv
In [13]: pd.read_csv(StringIO(''.join(q)), header=None)
Another option worth trying is to get the number of lines in a first passand then read the file again, skip that number of rows (minus n) using read_csv...
另一个值得尝试的选择是在第一遍中获取行数,然后再次读取文件,使用read_csv...跳过该行数(减去 n)。
回答by chepner
Files are simply streams of bytes. Lines do not exist as separate entities; they are an artifact of treating certain bytes as newline characters. As such, you mustread from the beginning of the file to identify lines in order.
文件只是字节流。线不作为单独的实体存在;它们是将某些字节视为换行符的人工制品。因此,您必须从文件的开头读取以按顺序识别行。
If the file doesn't change (often) and this is an operation you need to perform often (say, with different values of n), you can store the byte offsets of the newline characters in a second file. You can use this much-smaller file and the seekcommand to quickly jump to a given line in the first file and read from there.
如果文件没有更改(经常)并且这是您需要经常执行的操作(例如,使用不同的 值n),您可以将换行符的字节偏移量存储在第二个文件中。您可以使用这个小得多的文件和seek命令快速跳转到第一个文件中的给定行并从那里读取。
(Some operating systems provide record-oriented files that have more complex internal structure than the common flat file. The above does not apply to them.)
(某些操作系统提供了面向记录的文件,其内部结构比普通平面文件更复杂。以上不适用于它们。)
回答by Parikshit Bhinde
Here's a handy way to do. Works well for what I like to do -
这是一个方便的方法。非常适合我喜欢做的事情 -
import tailer as tl
import pandas as pd
import io
file = open(fname)
lastLines = tl.tail(file,15) #to read last 15 lines, change it to any value.
file.close()
df=pd.read_csv(io.StringIO('\n'.join(lastLines)), header=None)
回答by Yi Wu
Since you are considering reversing the file, I assume it's OK to create new files.
由于您正在考虑反转文件,因此我认为可以创建新文件。
- create a new file with the last n lines.
tail -n original.csv > temp.csv - add header line to the temp file and generate the new file.
head -1 original.csv | cat - temp.csv > newfile.csv && rm -f temp.csv
- 用最后 n 行创建一个新文件。
tail -n original.csv > temp.csv - 将标题行添加到临时文件并生成新文件。
head -1 original.csv | cat - temp.csv > newfile.csv && rm -f temp.csv
回答by HelloWorld123
You can use nrows=argument in pd.read_csv. The below code will give you the first 100 rows.
您可以nrows=在pd.read_csv 中使用参数。下面的代码将为您提供前 100 行。
pd.read_csv('file.csv', nrows=100)

