Python 读取熊猫数据框前几行的方法

Question

提问by beardc

Is there a built-in way to use read_csvto read only the first nlines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).

是否有一种内置方法可以read_csv在n不知道行长的情况下只读取文件的第一行？我有一个需要很长时间才能阅读的大文件，有时只想使用第一行，比如 20 行来获取它的样本（并且不想加载完整的内容并取其头部）。

If I knew the total number of lines I could do something like footer_lines = total_lines - nand pass this to the skipfooterkeyword arg. My current solution is to manually grab the first nlines with python and StringIO it to pandas:

如果我知道总行数，我可以执行类似的操作footer_lines = total_lines - n并将其传递给skipfooter关键字 arg。我目前的解决方案是n用 python 和 StringIO手动抓取第一行到熊猫：

import pandas as pd
from StringIO import StringIO

n = 20
with open('big_file.csv', 'r') as f:
    head = ''.join(f.readlines(n))

df = pd.read_csv(StringIO(head))

It's not that bad, but is there a more concise, 'pandasic' (?) way to do it with keywords or something?

这还不错，但是有没有更简洁的“熊猫式”（？）方式来使用关键字或其他东西来做到这一点？

Answer 1

采纳答案by DSM

I think you can use the nrowsparameter. From the docs:

我认为您可以使用该nrows参数。从文档：

nrows : int, default None

    Number of rows of file to read. Useful for reading pieces of large files

which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):

这似乎有效。使用标准大型测试文件之一（988504479 字节，5344499 行）：

In [1]: import pandas as pd

In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s

In [3]: len(z)
Out[3]: 20

In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

Python 读取熊猫数据框前几行的方法

提问by beardc

采纳答案by DSM

相关推荐

最近更新

标签

Python 读取熊猫数据框前几行的方法

提问by beardc

采纳答案by DSM

相关推荐

如何在python中从第k列开始删除具有空值的行

Python 如何测试一个列表是否包含另一个列表？

Python加权随机

将json字符串转换为python对象

相关推荐

最近更新

标签