使用 Pandas 高效读取大型 CSV 文件而不会崩溃

Question

提问by Developer

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/the file is 533.4MB in my computer.

我正在尝试从http://grouplens.org/datasets/movielens/20m/读取一个名为 ratings.csv 的 .csv 文件，该文件在我的计算机中为 533.4MB。

This is what am writing in jupyter notebook

这就是我在 jupyter notebook 上写的东西

import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')

The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.

这里的问题是内核会崩溃或死亡并要求我重新启动并且它不断重复相同的内容。没有任何错误。请您提出解决此问题的任何替代方案，就好像我的计算机没有运行它的能力一样。

This works but it keeps rewriting

这有效，但它不断重写

chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()

Only the last chunk is written others are written-off

只有最后一个块被写入其他被注销

Answer 1

回答by cs95

You should consider using the chunksizeparameter in read_csvwhen reading in your dataframe, because it returns a TextFileReaderobject you can then pass to pd.concatto concatenate your chunks.

chunksize在read_csv读取数据帧时，您应该考虑使用in 参数，因为它返回一个TextFileReader对象，然后您可以将其传递pd.concat给以连接块。

chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)

If you just want to process each chunk individually, use,

如果您只想单独处理每个块，请使用，

chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv', 
                         chunksize=chunksize, 
                         iterator=True):
    do_something_with_chunk(chunk)

Answer 2

回答by Yury Wallet

try like this - 1) load with dask and then 2) convert to pandas

像这样尝试 - 1) 用 dask 加载，然后 2) 转换为Pandas

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv')
df_train=df_train.compute()
print("load train: " , time.clock()-t)

使用 Pandas 高效读取大型 CSV 文件而不会崩溃

提问by Developer

回答by cs95

回答by Yury Wallet

相关推荐

最近更新

标签

使用 Pandas 高效读取大型 CSV 文件而不会崩溃

提问by Developer

回答by cs95

回答by Yury Wallet

相关推荐

pandas 迭代器中的返回值类型和熊猫中迭代器的列名打印

pandas 不明白：ValueError: Can only tuple-index with a MultiIndex

Pandas 无法打开这个 Excel 文件

pandas Neo4j 使用 py2neo 从熊猫数据帧创建节点和关系

相关推荐

最近更新

标签