pandas 使用“chunksize”和/或“iterator”用pandas打开选定的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39053628/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
open selected rows with pandas using "chunksize" and/or "iterator"
提问by Stefano Fedele
I have a large csv file and I open it with pd.read_csv as it follows:
我有一个很大的 csv 文件,我用 pd.read_csv 打开它,如下所示:
df = pd.read_csv(path//fileName.csv, sep = ' ', header = None)
As the file is really large I would like to be able to open it in rows
由于文件非常大,我希望能够成行打开它
from 0 to 511
from 512 to 1023
from 1024 to 1535
...
from 512*n to 512*(n+1) - 1
Where n = 1, 2, 3 ...
其中 n = 1, 2, 3 ...
If I add chunksize = 512 into the arguments of read_csv
如果我将 chunksize = 512 添加到 read_csv 的参数中
df = pd.read_csv(path//fileName.csv, sep = ' ', header = None, chunksize = 512)
and I type
我输入
df.get_chunk(5)
Than I am able to open rows from 0 to 5 or I may be able to divide the file in parts of 512 rows using a for loop
我可以打开 0 到 5 行,或者我可以使用 for 循环将文件分成 512 行的一部分
data = []
for chunks in df:
data = data + [chunk]
But this is quite useless as still the file has to be completelly opened and takes time. How can I read only rows from 512*n to 512*(n+1).
但这毫无用处,因为文件仍然必须完全打开并且需要时间。如何只读取从 512*n 到 512*(n+1) 的行。
Looking around I often saw that "chunksize" is used together with "iterator" as it follows
环顾四周,我经常看到“chunksize”与“iterator”一起使用,如下所示
df = pd.read_csv(path//fileName.csv, sep = ' ', header = None, iterator = True, chunksize = 512)
But after many attempts I still don't understand which benefits provide me this boolean variable. Could you explain me it, please?
但是经过多次尝试后,我仍然不明白这个布尔变量有哪些好处。请你给我解释一下好吗?
回答by MaxU
How can I read only rows from 512*n to 512*(n+1)?
如何只读取从 512*n 到 512*(n+1) 的行?
df = pd.read_csv(fn, header=None, skiprows=512*n, nrows=512)
You can do it this way (and it's pretty useful):
你可以这样做(它非常有用):
for chunk in pd.read_csv(f, sep = ' ', header = None, chunksize = 512):
# process your chunk here
Demo:
演示:
In [61]: fn = 'd:/temp/a.csv'
In [62]: pd.DataFrame(np.random.randn(30, 3), columns=list('abc')).to_csv(fn, index=False)
In [63]: for chunk in pd.read_csv(fn, chunksize=10):
....: print(chunk)
....:
a b c
0 2.229657 -1.040086 1.295774
1 0.358098 -1.080557 -0.396338
2 0.731741 -0.690453 0.126648
3 -0.009388 -1.549381 0.913128
4 -0.256654 -0.073549 -0.171606
5 0.849934 0.305337 2.360101
6 -1.472184 0.641512 -1.301492
7 -2.302152 0.417787 0.485958
8 0.492314 0.603309 0.890524
9 -0.730400 0.835873 1.313114
a b c
0 1.393865 -1.115267 1.194747
1 3.038719 -0.343875 -1.410834
2 -1.510598 0.664154 -0.996762
3 -0.528211 1.269363 0.506728
4 0.043785 -0.786499 -1.073502
5 1.096647 -1.127002 0.918172
6 -0.792251 -0.652996 -1.000921
7 1.582166 -0.819374 0.247077
8 -1.022418 -0.577469 0.097406
9 -0.274233 -0.244890 -0.352108
a b c
0 -0.317418 0.774854 -0.203939
1 0.205443 0.820302 -2.637387
2 0.332696 -0.655431 -0.089120
3 -0.884916 0.274854 1.074991
4 0.412295 -1.561943 -0.850376
5 -1.933529 -1.346236 -1.789500
6 1.652446 -0.800644 -0.126594
7 0.520916 -0.825257 -0.475727
8 -2.261692 2.827894 -0.439698
9 -0.424714 1.862145 1.103926
In which case "iterator" can be useful?
在哪种情况下“迭代器”可能有用?
when using chunksize
- all chunks will have the same length. Using iterator
parameter you can define how much data (get_chunk(nrows)
) you want to read in each iteration:
使用时chunksize
- 所有块都将具有相同的长度。使用iterator
参数,您可以定义get_chunk(nrows)
每次迭代中要读取的数据量 ( ):
In [66]: reader = pd.read_csv(fn, iterator=True)
let's read first 3 rows
让我们阅读前 3 行
In [67]: reader.get_chunk(3)
Out[67]:
a b c
0 2.229657 -1.040086 1.295774
1 0.358098 -1.080557 -0.396338
2 0.731741 -0.690453 0.126648
now we'll read next 5 rows:
现在我们将阅读接下来的 5 行:
In [68]: reader.get_chunk(5)
Out[68]:
a b c
0 -0.009388 -1.549381 0.913128
1 -0.256654 -0.073549 -0.171606
2 0.849934 0.305337 2.360101
3 -1.472184 0.641512 -1.301492
4 -2.302152 0.417787 0.485958
next 7 rows:
接下来的 7 行:
In [69]: reader.get_chunk(7)
Out[69]:
a b c
0 0.492314 0.603309 0.890524
1 -0.730400 0.835873 1.313114
2 1.393865 -1.115267 1.194747
3 3.038719 -0.343875 -1.410834
4 -1.510598 0.664154 -0.996762
5 -0.528211 1.269363 0.506728
6 0.043785 -0.786499 -1.073502