pandas 有效地将 CSV 的最后“n”行读入 DataFrame

Question

提问by Nipun Batra

A few methods to do this:

有几种方法可以做到这一点：

Read the entire CSV and then use df.tail
Somehow reverse the file (whats the best way to do this for large files?) and then use nrowsargument to read
Somehow find the number of rows in the CSV, then use skiprowsand read required number of rows.
Maybe do chunk read discarding initial chunks (though not sure how this would work)

阅读整个CSV，然后使用 df.tail
以某种方式反转文件（对大文件执行此操作的最佳方法是什么？）然后使用nrows参数读取
以某种方式找到 CSV 中的行数，然后使用skiprows并读取所需的行数。
也许做块读取丢弃初始块（虽然不确定这将如何工作）

Can it be done in some easier way? If not, which amongst these three should be prefered and why?

可以以更简单的方式完成吗？如果不是，应该优先选择这三个中的哪一个，为什么？

Possibly related:

可能相关：

Not directly related:

没有直接关系：

How to get the last n row of pandas dataframe?

如何获取pandas数据框的最后n行？

Answer 1

回答by Andy Hayden

I don't think pandas offers a way to do this in read_csv.

我不认为 Pandas 提供了一种在read_csv.

Perhaps the neatest (in one pass) is to use collections.deque:

也许最简洁的（一次通过）是使用collections.deque：

from collections import deque
from StringIO import StringIO

with open(fname, 'r') as f:
    q = deque(f, 2)  # replace 2 with n (lines read at the end)

In [12]: q
Out[12]: deque(['7,8,9\n', '10,11,12'], maxlen=2)
         # these are the last two lines of my csv

In [13]: pd.read_csv(StringIO(''.join(q)), header=None)

Another option worth trying is to get the number of lines in a first passand then read the file again, skip that number of rows (minus n) using read_csv...

另一个值得尝试的选择是在第一遍中获取行数，然后再次读取文件，使用read_csv...跳过该行数（减去 n）。

Answer 2

回答by chepner

Files are simply streams of bytes. Lines do not exist as separate entities; they are an artifact of treating certain bytes as newline characters. As such, you mustread from the beginning of the file to identify lines in order.

文件只是字节流。线不作为单独的实体存在；它们是将某些字节视为换行符的人工制品。因此，您必须从文件的开头读取以按顺序识别行。

If the file doesn't change (often) and this is an operation you need to perform often (say, with different values of n), you can store the byte offsets of the newline characters in a second file. You can use this much-smaller file and the seekcommand to quickly jump to a given line in the first file and read from there.

如果文件没有更改（经常）并且这是您需要经常执行的操作（例如，使用不同的值n），您可以将换行符的字节偏移量存储在第二个文件中。您可以使用这个小得多的文件和seek命令快速跳转到第一个文件中的给定行并从那里读取。

(Some operating systems provide record-oriented files that have more complex internal structure than the common flat file. The above does not apply to them.)

（某些操作系统提供了面向记录的文件，其内部结构比普通平面文件更复杂。以上不适用于它们。）

Answer 3

回答by Parikshit Bhinde

Here's a handy way to do. Works well for what I like to do -

这是一个方便的方法。非常适合我喜欢做的事情 -

import tailer as tl
import pandas as pd
import io
file = open(fname)
lastLines = tl.tail(file,15) #to read last 15 lines, change it  to any value.
file.close()
df=pd.read_csv(io.StringIO('\n'.join(lastLines)), header=None)

Answer 4

回答by Yi Wu

Since you are considering reversing the file, I assume it's OK to create new files.

由于您正在考虑反转文件，因此我认为可以创建新文件。

create a new file with the last n lines. tail -n original.csv > temp.csv
add header line to the temp file and generate the new file. head -1 original.csv | cat - temp.csv > newfile.csv && rm -f temp.csv

用最后 n 行创建一个新文件。 tail -n original.csv > temp.csv
将标题行添加到临时文件并生成新文件。 head -1 original.csv | cat - temp.csv > newfile.csv && rm -f temp.csv

Answer 5

回答by HelloWorld123

You can use nrows=argument in pd.read_csv. The below code will give you the first 100 rows.

您可以nrows=在pd.read_csv 中使用参数。下面的代码将为您提供前 100 行。

pd.read_csv('file.csv', nrows=100)

pandas 有效地将 CSV 的最后“n”行读入 DataFrame

提问by Nipun Batra

回答by Andy Hayden

回答by chepner

回答by Parikshit Bhinde

回答by Yi Wu

回答by HelloWorld123

相关推荐

最近更新

标签

pandas 有效地将 CSV 的最后“n”行读入 DataFrame

提问by Nipun Batra

回答by Andy Hayden

回答by chepner

回答by Parikshit Bhinde

回答by Yi Wu

回答by HelloWorld123

相关推荐

将 Pandas DataFrame 列中括号之间的文本复制到另一列中

python pandas：pivot_table 用 nans 静默删除索引

如何：Python Pandas 获取当前股票数据

在 pandas groupby 之后删除一个组

相关推荐

最近更新

标签