使用 Pandas 高效读取大型 CSV 文件而不会崩溃

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45870220/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:18:58  来源:igfitidea点击:

Using pandas to efficiently read in a large CSV file without crashing

pythonpandascsvdataframejupyter-notebook

提问by Developer

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/the file is 533.4MB in my computer.

我正在尝试从http://grouplens.org/datasets/movielens/20m/读取一个名为 ratings.csv 的 .csv 文件,该文件在我的计算机中为 533.4MB。

This is what am writing in jupyter notebook

这就是我在 jupyter notebook 上写的东西

import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')

The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.

这里的问题是内核会崩溃或死亡并要求我重新启动并且它不断重复相同的内容。没有任何错误。请您提出解决此问题的任何替代方案,就好像我的计算机没有运行它的能力一样。

This works but it keeps rewriting

这有效,但它不断重写

chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()

Only the last chunk is written others are written-off

只有最后一个块被写入其他被注销

回答by cs95

You should consider using the chunksizeparameter in read_csvwhen reading in your dataframe, because it returns a TextFileReaderobject you can then pass to pd.concatto concatenate your chunks.

chunksizeread_csv读取数据帧时,您应该考虑使用in 参数,因为它返回一个TextFileReader对象,然后您可以将其传递pd.concat给以连接块。

chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)


If you just want to process each chunk individually, use,

如果您只想单独处理每个块,请使用,

chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv', 
                         chunksize=chunksize, 
                         iterator=True):
    do_something_with_chunk(chunk)

回答by Yury Wallet

try like this - 1) load with dask and then 2) convert to pandas

像这样尝试 - 1) 用 dask 加载,然后 2) 转换为Pandas

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv')
df_train=df_train.compute()
print("load train: " , time.clock()-t)