Python 读取熊猫数据框前几行的方法
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15008970/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Way to read first few lines for pandas dataframe
提问by beardc
Is there a built-in way to use read_csvto read only the first nlines of a file without knowing the length of the lines ahead of time? I have a large file that takes a long time to read, and occasionally only want to use the first, say, 20 lines to get a sample of it (and prefer not to load the full thing and take the head of it).
是否有一种内置方法可以read_csv在n不知道行长的情况下只读取文件的第一行?我有一个需要很长时间才能阅读的大文件,有时只想使用第一行,比如 20 行来获取它的样本(并且不想加载完整的内容并取其头部)。
If I knew the total number of lines I could do something like footer_lines = total_lines - nand pass this to the skipfooterkeyword arg. My current solution is to manually grab the first nlines with python and StringIO it to pandas:
如果我知道总行数,我可以执行类似的操作footer_lines = total_lines - n并将其传递给skipfooter关键字 arg。我目前的解决方案是n用 python 和 StringIO手动抓取第一行到熊猫:
import pandas as pd
from StringIO import StringIO
n = 20
with open('big_file.csv', 'r') as f:
head = ''.join(f.readlines(n))
df = pd.read_csv(StringIO(head))
It's not that bad, but is there a more concise, 'pandasic' (?) way to do it with keywords or something?
这还不错,但是有没有更简洁的“熊猫式”(?)方式来使用关键字或其他东西来做到这一点?
采纳答案by DSM
I think you can use the nrowsparameter. From the docs:
我认为您可以使用该nrows参数。从文档:
nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files
which seems to work. Using one of the standard large test files (988504479 bytes, 5344499 lines):
这似乎有效。使用标准大型测试文件之一(988504479 字节,5344499 行):
In [1]: import pandas as pd
In [2]: time z = pd.read_csv("P00000001-ALL.csv", nrows=20)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
In [3]: len(z)
Out[3]: 20
In [4]: time z = pd.read_csv("P00000001-ALL.csv")
CPU times: user 27.63 s, sys: 1.92 s, total: 29.55 s
Wall time: 30.23 s

