Python 读取一个巨大的 .csv 文件
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17444679/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Reading a huge .csv file
提问by Charles Dillon
I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:
我目前正在尝试从 Python 2.7 中的 .csv 文件中读取数据,最多 100 万行和 200 列(文件范围从 100mb 到 1.6gb)。对于行数低于 300,000 的文件,我可以(非常缓慢地)执行此操作,但是一旦超过该值,就会出现内存错误。我的代码如下所示:
def getdata(filename, criteria):
data=[]
for criterion in criteria:
data.append(getstuff(filename, criteron))
return data
def getstuff(filename, criterion):
import csv
data=[]
with open(filename, "rb") as csvfile:
datareader=csv.reader(csvfile)
for row in datareader:
if row[3]=="column header":
data.append(row)
elif len(data)<2 and row[3]!=criterion:
pass
elif row[3]==criterion:
data.append(row)
else:
return data
The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.
在 getstuff 函数中使用 else 子句的原因是所有符合条件的元素都将一起列在 csv 文件中,所以当我通过它们时我会离开循环以节省时间。
My questions are:
我的问题是:
How can I manage to get this to work with the bigger files?
Is there any way I can make it faster?
我怎样才能设法让它与更大的文件一起工作?
有什么办法可以让它更快吗?
My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).
我的电脑有 8GB RAM,运行 64 位 Windows 7,处理器为 3.40 GHz(不确定您需要什么信息)。
采纳答案by Martijn Pieters
You are reading all rows into a list, then processing that list. Don't do that.
您正在将所有行读入一个列表,然后处理该列表。不要那样做。
Process your rows as you produce them. If you need to filter the data first, use a generator function:
在生成行时处理它们。如果您需要先过滤数据,请使用生成器函数:
import csv
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
count = 0
for row in datareader:
if row[3] == criterion:
yield row
count += 1
elif count:
# done when having read a consecutive series of rows
return
I also simplified your filter test; the logic is the same but more concise.
我还简化了您的过滤器测试;逻辑是一样的,但更简洁。
Because you are only matching a single sequence of rows matching the criterion, you could also use:
因为您只匹配符合条件的单个行序列,您还可以使用:
import csv
from itertools import dropwhile, takewhile
def getstuff(filename, criterion):
with open(filename, "rb") as csvfile:
datareader = csv.reader(csvfile)
yield next(datareader) # yield the header row
# first row, plus any subsequent rows that match, then stop
# reading altogether
# Python 2: use `for row in takewhile(...): yield row` instead
# instead of `yield from takewhile(...)`.
yield from takewhile(
lambda r: r[3] == criterion,
dropwhile(lambda r: r[3] != criterion, datareader))
return
You can now loop over getstuff()
directly. Do the same in getdata()
:
您现在可以getstuff()
直接循环。做同样的事情getdata()
:
def getdata(filename, criteria):
for criterion in criteria:
for row in getstuff(filename, criterion):
yield row
Now loop directly over getdata()
in your code:
现在直接getdata()
在您的代码中循环:
for row in getdata(somefilename, sequence_of_criteria):
# process row
You now only hold one rowin memory, instead of your thousands of lines per criterion.
您现在只在内存中保存一行,而不是每个标准的数千行。
yield
makes a function a generator function, which means it won't do any work until you start looping over it.
yield
使函数成为生成器函数,这意味着在您开始循环之前它不会做任何工作。
回答by Steve
I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv()function to be 20times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you needpandas for large data sets.
我做了大量的振动分析并查看大型数据集(数千万个点)。我的测试表明pandas.read_csv()函数比 numpy.genfromtxt() 快20倍。并且 genfromtxt() 函数比 numpy.loadtxt() 快 3 倍。看来你需要大数据集的熊猫。
I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.
我在讨论MATLAB 与 Python 进行振动分析的博客上发布了我在此测试中使用的代码和数据集。
回答by mmann1123
Although Martijin's answer is prob best. Here is a more intuitive way to process large csv files for beginners. This allows you to process groups of rows, or chunks, at a time.
尽管 Martijin 的回答是最好的。这里有一种更直观的方法来处理初学者的大型 csv 文件。这允许您一次处理多组行或块。
import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
回答by Yury Wallet
what worked for me was and is superfast is
对我有用的是并且超快的是
import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)
Another working solution is:
另一个可行的解决方案是:
import pandas as pd
from tqdm import tqdm
PATH = '../data/train.csv'
chunksize = 500000
traintypes = {
'col1':'category',
'col2':'str'}
cols = list(traintypes.keys())
df_list = [] # list to hold the batch dataframe
for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
# Can process each chunk of dataframe here
# clean_data(), feature_engineer(),fit()
# Alternatively, append the chunk to list and merge all
df_list.append(df_chunk)
# Merge all dataframes into one dataframe
X = pd.concat(df_list)
# Delete the dataframe list to release memory
del df_list
del df_chunk
回答by Rishabh Agrahari
here's another solution for Python3:
这是 Python3 的另一个解决方案:
import csv
with open(filename, "r") as csvfile:
datareader = csv.reader(csvfile)
count = 0
for row in datareader:
if row[3] in ("column header", criterion):
doSomething(row)
count += 1
elif count > 2:
break
here datareader
is a generator function.
这datareader
是一个生成器函数。
回答by ewalel
For someone who lands to this question. Using pandaswith ‘chunksize' and ‘usecols' helped me to read a huge zip file faster than the other proposed options.
对于遇到这个问题的人。利用大熊猫与“ CHUNKSIZE”和“ usecols”帮我看一个巨大的zip文件比其他建议选择更快。
import pandas as pd
sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']
# First setup dataframe iterator, ‘usecols' parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep)
# this list will store the filtered dataframes for later concatenation
df_lst = []
# Iterate over the file based on the criteria and append to the list
for df_ in df_iter:
tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
.pipe(lambda x: x[x.col_1 > 0] ))
df_lst += [tmp_df.copy()]
# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe
df_final = pd.concat(df_lst)
回答by Mike T
If you are using pandas and have lots of RAM (enough to read the whole file into memory) try using pd.read_csv
with low_memory=False
, e.g.:
如果您使用的是 Pandas 并且有大量 RAM(足以将整个文件读入内存),请尝试使用pd.read_csv
with low_memory=False
,例如:
import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)