Python 如何使用 Pandas 读取大型 csv 文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25962114/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do I read a large csv file with pandas?
提问by Rajkumar Kumawat
I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:
我正在尝试在 Pandas 中读取一个大型 csv 文件(大约 6 GB),但出现内存错误:
MemoryError Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')
...
MemoryError:
Any help on this?
这有什么帮助吗?
回答by unutbu
The error shows that the machine does not have enough memory to read the entire
CSV into a DataFrame at one time. Assuming you do not need the entire dataset in
memory all at one time, one way to avoid the problem would be to process the CSV in
chunks(by specifying the chunksizeparameter):
该错误表明机器没有足够的内存来一次将整个 CSV 读入 DataFrame。假设您一次不需要内存中的整个数据集,避免该问题的一种方法是分块处理 CSV(通过指定chunksize参数):
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
The chunksizeparameter specifies the number of rows per chunk.
(The last chunk may contain fewer than chunksizerows, of course.)
该chunksize参数指定每个块的行数。(当然,最后一个块可能包含少于chunksize行。)
回答by PlagTag
The above answer is already satisfying the topic. Anyway, if you need all the data in memory - have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features
上面的回答已经很满意题目了。无论如何,如果您需要内存中的所有数据 - 看看bcolz。它压缩内存中的数据。我有非常好的经验。但它缺少很多熊猫功能
Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.
编辑:我认为我的压缩率约为 1/10 或原始大小,当然这取决于数据的类型。缺少的重要特征是聚合。
回答by Rajkumar Kumawat
I proceeded like this:
我是这样进行的:
chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
names=['lat','long','rf','date','slno'],index_col='slno',\
header=None,parse_dates=['date'])
df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)
回答by nunodsousa
You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.
您可以尝试 sframe,它的语法与 Pandas 相同,但允许您操作比 RAM 大的文件。
回答by Tyrion W
The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.
函数 read_csv 和 read_table 几乎相同。但是在程序中使用read_table函数时,必须指定分隔符“,”。
def get_from_action_data(fname, chunk_size=100000):
reader = pd.read_csv(fname, header=0, iterator=True)
chunks = []
loop = True
while loop:
try:
chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
chunks.append(chunk)
except StopIteration:
loop = False
print("Iteration is stopped")
df_ac = pd.concat(chunks, ignore_index=True)
回答by paulg
If you use pandas read large file into chunk and then yield row by row, here is what I have done
如果您使用熊猫将大文件读入块然后逐行生成,这就是我所做的
import pandas as pd
def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ):
yield (chunk)
def _generator( filename, header=False,chunk_size = 10 ** 5):
chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
for row in chunk:
yield row
if __name__ == "__main__":
filename = r'file.csv'
generator = generator(filename=filename)
while True:
print(next(generator))
回答by jpp
Chunking shouldn't always be the first port of call for this problem.
分块不应该总是这个问题的第一站。
Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings by reading in columns as categoriesand selecting required columns via pd.read_csv
usecolsparameter.Does your workflow require slicing, manipulating, exporting?
If so, you can use dask.dataframeto slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
Chunk via pandasor via csv libraryas a last resort.
由于重复的非数字数据或不需要的列,文件是否很大?
如果是这样,您有时可以通过将列作为类别读取并通过pd.read_csv
usecols参数选择所需的列来节省大量内存。您的工作流程是否需要切片、操作、导出?
如果是这样,您可以使用dask.dataframe进行切片、执行计算并迭代导出。分块由 dask 静默执行,它也支持 Pandas API 的一个子集。
如果所有其他方法都失败了,请通过块逐行读取。
回答by Simbarashe Timothy Motsi
For large data l recommend you use the library "dask"
e.g:
对于大数据,我建议您使用库“dask”,
例如:
# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')
You can read more from the documentation here.
您可以从此处的文档中阅读更多内容。
Another great alternative would be to use modinbecause all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.
另一个很好的选择是使用modin,因为所有功能都与 pandas 相同,但它利用了分布式数据帧库,例如 dask。
回答by citynorman
In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL, d6tstackis another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.
除了上面的答案,对于那些想要处理 CSV 然后导出到 csv、parquet 或 SQL 的人来说,d6tstack是另一个不错的选择。您可以加载多个文件并处理数据架构更改(添加/删除列)。分块的核心支持已经内置。
def apply(dfg):
# do stuff
return dfg
c = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)
# or
c = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)
# output to various formats, automatically chunked to reduce memory consumption
c.to_csv_combine(filename='out.csv')
c.to_parquet_combine(filename='out.pq')
c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgres
c.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysql
c.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible
回答by Humpe
You can read in the data as chunks and save each chunk as pickle.
您可以将数据作为块读入并将每个块保存为泡菜。
import pandas as pd
import pickle
in_path = "" #Path where the large file is
out_path = "" #Path to save the pickle files to
chunk_size = 400000 #size of chunks relies on your available memory
separator = "~"
reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
for i, chunk in enumerate(reader):
out_file = out_path + "/data_{}.pkl".format(i+1)
with open(out_file, "wb") as f:
pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)
In the next step you read in the pickles and append each pickle to your desired dataframe.
在下一步中,您读取泡菜并将每个泡菜附加到所需的数据帧。
import glob
pickle_path = "" #Same Path as out_path i.e. where the pickle files are
data_p_files=[]
for name in glob.glob(pickle_path + "/data_*.pkl"):
data_p_files.append(name)
df = pd.DataFrame([])
for i in range(len(data_p_files)):
df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

