Python 读取一个巨大的 .csv 文件

Question

提问by Charles Dillon

I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

我目前正在尝试从 Python 2.7 中的 .csv 文件中读取数据，最多 100 万行和 200 列（文件范围从 100mb 到 1.6gb）。对于行数低于 300,000 的文件，我可以（非常缓慢地）执行此操作，但是一旦超过该值，就会出现内存错误。我的代码如下所示：

def getdata(filename, criteria):
    data=[]
    for criterion in criteria:
        data.append(getstuff(filename, criteron))
    return data

def getstuff(filename, criterion):
    import csv
    data=[]
    with open(filename, "rb") as csvfile:
        datareader=csv.reader(csvfile)
        for row in datareader: 
            if row[3]=="column header":
                data.append(row)
            elif len(data)<2 and row[3]!=criterion:
                pass
            elif row[3]==criterion:
                data.append(row)
            else:
                return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

在 getstuff 函数中使用 else 子句的原因是所有符合条件的元素都将一起列在 csv 文件中，所以当我通过它们时我会离开循环以节省时间。

My questions are:

我的问题是：

How can I manage to get this to work with the bigger files?
Is there any way I can make it faster?

我怎样才能设法让它与更大的文件一起工作？
有什么办法可以让它更快吗？

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

我的电脑有 8GB RAM，运行 64 位 Windows 7，处理器为 3.40 GHz（不确定您需要什么信息）。

Answer 1

采纳答案by Martijn Pieters

You are reading all rows into a list, then processing that list. Don't do that.

您正在将所有行读入一个列表，然后处理该列表。不要那样做。

Process your rows as you produce them. If you need to filter the data first, use a generator function:

在生成行时处理它们。如果您需要先过滤数据，请使用生成器函数：

import csv

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        count = 0
        for row in datareader:
            if row[3] == criterion:
                yield row
                count += 1
            elif count:
                # done when having read a consecutive series of rows 
                return

I also simplified your filter test; the logic is the same but more concise.

我还简化了您的过滤器测试；逻辑是一样的，但更简洁。

Because you are only matching a single sequence of rows matching the criterion, you could also use:

因为您只匹配符合条件的单个行序列，您还可以使用：

import csv
from itertools import dropwhile, takewhile

def getstuff(filename, criterion):
    with open(filename, "rb") as csvfile:
        datareader = csv.reader(csvfile)
        yield next(datareader)  # yield the header row
        # first row, plus any subsequent rows that match, then stop
        # reading altogether
        # Python 2: use `for row in takewhile(...): yield row` instead
        # instead of `yield from takewhile(...)`.
        yield from takewhile(
            lambda r: r[3] == criterion,
            dropwhile(lambda r: r[3] != criterion, datareader))
        return

You can now loop over getstuff()directly. Do the same in getdata():

您现在可以getstuff()直接循环。做同样的事情getdata()：

def getdata(filename, criteria):
    for criterion in criteria:
        for row in getstuff(filename, criterion):
            yield row

Now loop directly over getdata()in your code:

现在直接getdata()在您的代码中循环：

for row in getdata(somefilename, sequence_of_criteria):
    # process row

You now only hold one rowin memory, instead of your thousands of lines per criterion.

您现在只在内存中保存一行，而不是每个标准的数千行。

yieldmakes a function a generator function, which means it won't do any work until you start looping over it.

yield使函数成为生成器函数，这意味着在您开始循环之前它不会做任何工作。

Answer 2

回答by Steve

I do a fair amount of vibration analysis and look at large data sets (tens and hundreds of millions of points). My testing showed the pandas.read_csv()function to be 20times faster than numpy.genfromtxt(). And the genfromtxt() function is 3 times faster than the numpy.loadtxt(). It seems that you needpandas for large data sets.

我做了大量的振动分析并查看大型数据集（数千万个点）。我的测试表明pandas.read_csv()函数比 numpy.genfromtxt() 快20倍。并且 genfromtxt() 函数比 numpy.loadtxt() 快 3 倍。看来你需要大数据集的熊猫。

I posted the code and data sets I used in this testing on a blog discussing MATLAB vs Python for vibration analysis.

我在讨论MATLAB 与 Python 进行振动分析的博客上发布了我在此测试中使用的代码和数据集。

Answer 3

回答by mmann1123

Although Martijin's answer is prob best. Here is a more intuitive way to process large csv files for beginners. This allows you to process groups of rows, or chunks, at a time.

尽管 Martijin 的回答是最好的。这里有一种更直观的方法来处理初学者的大型 csv 文件。这允许您一次处理多组行或块。

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Answer 4

回答by Yury Wallet

what worked for me was and is superfast is

对我有用的是并且超快的是

import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv', usecols=[col1, col2])
df_train=df_train.compute()
print("load train: " , time.clock()-t)

Another working solution is:

另一个可行的解决方案是：

import pandas as pd 
from tqdm import tqdm

PATH = '../data/train.csv'
chunksize = 500000 
traintypes = {
'col1':'category',
'col2':'str'}

cols = list(traintypes.keys())

df_list = [] # list to hold the batch dataframe

for df_chunk in tqdm(pd.read_csv(PATH, usecols=cols, dtype=traintypes, chunksize=chunksize)):
    # Can process each chunk of dataframe here
    # clean_data(), feature_engineer(),fit()

    # Alternatively, append the chunk to list and merge all
    df_list.append(df_chunk) 

# Merge all dataframes into one dataframe
X = pd.concat(df_list)

# Delete the dataframe list to release memory
del df_list
del df_chunk

Answer 5

回答by Rishabh Agrahari

here's another solution for Python3:

这是 Python3 的另一个解决方案：

import csv
with open(filename, "r") as csvfile:
    datareader = csv.reader(csvfile)
    count = 0
    for row in datareader:
        if row[3] in ("column header", criterion):
            doSomething(row)
            count += 1
        elif count > 2:
            break

here datareaderis a generator function.

这datareader是一个生成器函数。

Answer 6

回答by ewalel

For someone who lands to this question. Using pandaswith ‘chunksize' and ‘usecols' helped me to read a huge zip file faster than the other proposed options.

对于遇到这个问题的人。利用大熊猫与“ CHUNKSIZE”和“ usecols”帮我看一个巨大的zip文件比其他建议选择更快。

import pandas as pd

sample_cols_to_keep =['col_1', 'col_2', 'col_3', 'col_4','col_5']

# First setup dataframe iterator, ‘usecols' parameter filters the columns, and 'chunksize' sets the number of rows per chunk in the csv. (you can change these parameters as you wish)
df_iter = pd.read_csv('../data/huge_csv_file.csv.gz', compression='gzip', chunksize=20000, usecols=sample_cols_to_keep) 

# this list will store the filtered dataframes for later concatenation 
df_lst = [] 

# Iterate over the file based on the criteria and append to the list
for df_ in df_iter: 
        tmp_df = (df_.rename(columns={col: col.lower() for col in df_.columns}) # filter eg. rows where 'col_1' value grater than one
                                  .pipe(lambda x:  x[x.col_1 > 0] ))
        df_lst += [tmp_df.copy()] 

# And finally combine filtered df_lst into the final lareger output say 'df_final' dataframe 
df_final = pd.concat(df_lst)

Answer 7

回答by Mike T

If you are using pandas and have lots of RAM (enough to read the whole file into memory) try using pd.read_csvwith low_memory=False, e.g.:

如果您使用的是 Pandas 并且有大量 RAM（足以将整个文件读入内存），请尝试使用pd.read_csvwith low_memory=False，例如：

import pandas as pd
data = pd.read_csv('file.csv', low_memory=False)

Python 读取一个巨大的 .csv 文件

提问by Charles Dillon

采纳答案by Martijn Pieters

回答by Steve

回答by mmann1123

回答by Yury Wallet

回答by Rishabh Agrahari

回答by ewalel

回答by Mike T

相关推荐

最近更新

标签

Python 读取一个巨大的 .csv 文件

提问by Charles Dillon

采纳答案by Martijn Pieters

回答by Steve

回答by mmann1123

回答by Yury Wallet

回答by Rishabh Agrahari

回答by ewalel

回答by Mike T

相关推荐

Python 如何在不使用“|”的情况下将两组合并为一行

比较两个 Python Pandas 数据框的 2 列并获取公共行

Python Matplotlib：在等高线图上将散点图绘制到前景

Python 使用 Anaconda 克隆根环境

相关推荐

最近更新

标签