pandas 如何加快读取多个文件并将数据放入数据帧的速度?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42157944/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:57:14  来源:igfitidea点击:

How can I speed up reading multiple files and putting the data into a dataframe?

pythonregexperformanceparsingpandas

提问by bluprince13

I have a number of text files, say 50, that I need to read into a massive dataframe. At the moment, I am using the following steps.

我有许多文本文件,比如 50 个,我需要读入一个庞大的数据帧。目前,我正在使用以下步骤。

  1. Read every file and check what the labels are. The information I need is often contained in the first few lines. The same labels just repeat for the rest of the file, with different types of data listed against them each time.
  2. Create a dataframe with those labels.
  3. Read the file again and fill the dataframe with values.
  4. Concatenate that dataframe with a master dataframe.
  1. 阅读每个文件并检查标签是什么。我需要的信息通常包含在前几行中。相同的标签只是对文件的其余部分重复,每次都会针对它们列出不同类型的数据。
  2. 使用这些标签创建一个数据框。
  3. 再次读取文件并用值填充数据框。
  4. 将该数据框与主数据框连接起来。

This works pretty well for files that are of the 100 KB size - a few minutes, but at 50 MB, it just takes hours, and is not practical.

这对于 100 KB 大小的文件非常有效 - 几分钟,但在 50 MB 时,它只需要几个小时,而且不实用。

How can I optimise my code? In particular -

如何优化我的代码?特别是 -

  1. How can I identify what functions are taking the most time, which I need to optimise? Is it the reading of the file? Is it the writing to the dataframe? Where is my program spending time?
  2. Should I consider multithreading or multiprocessing?
  3. Can I improve the algorithm?
    • Perhaps read the entire file in in one go into a list, rather than line by line,
    • Parse data in chunks/entire file, rather than line by line,
    • Assign data to the dataframe in chunks/one go, rather than row by row.
  4. Is there anything else that I can do to make my code execute faster?
  1. 我如何确定哪些功能占用的时间最多,哪些是我需要优化的?是读取文件吗?是写入数据帧吗?我的程序花时间在哪里?
  2. 我应该考虑多线程还是多处理?
  3. 我可以改进算法吗?
    • 也许将整个文件一次性读入一个列表,而不是一行一行,
    • 在块/整个文件中解析数据,而不是逐行解析,
    • 以块/一次的方式将数据分配给数据帧,而不是逐行。
  4. 我还能做些什么来使我的代码执行得更快?

Here is an example code. My own code is a little more complex, as the text files are more complex such that I have to use about 10 regular expressions and multiple while loops to read the data in and allocate it to the right location in the right array. To keep the MWE simple, I haven't used repeating labels in the input files for the MWE either, so it would like I'm reading the file twice for no reason. I hope that makes sense!

这是一个示例代码。我自己的代码有点复杂,因为文本文件更复杂,以至于我必须使用大约 10 个正则表达式和多个 while 循环来读取数据并将其分配到正确数组中的正确位置。为了保持 MWE 的简单性,我也没有在 MWE 的输入文件中使用重复标签,所以我想我无缘无故地读取文件两次。我希望这是有道理的!

import re
import pandas as pd

df = pd.DataFrame()
paths = ["../gitignore/test1.txt", "../gitignore/test2.txt"]
reg_ex = re.compile('^(.+) (.+)\n')
# read all files to determine what indices are available
for path in paths:
    file_obj = open(path, 'r')
    print file_obj.readlines()

['a 1\n', 'b 2\n', 'end']
['c 3\n', 'd 4\n', 'end']

indices = []
for path in paths:
    index = []
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                index += match.group(1)
            except AttributeError:
                pass
    indices.append(index)
# read files again and put data into a master dataframe
for path, index in zip(paths, indices):
    subset_df = pd.DataFrame(index=index, columns=["Number"])
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                subset_df.loc[[match.group(1)]] = match.group(2)
            except AttributeError:
                pass
    df = pd.concat([df, subset_df]).sort_index()
print df

  Number
a      1
b      2
c      3
d      4

My input files:

我的输入文件:

test1.txt

测试1.txt

a 1
b 2
end

test2.txt

测试2.txt

c 3
d 4
end

采纳答案by bluprince13

It turns out that creating a blank DataFrame first, searching the index to find the right place for a row of data, and then updating just that one row of the DataFrame is a stupidly time expensive process.

事实证明,首先创建一个空白的 DataFrame,搜索索引以找到一行数据的正确位置,然后只更新 DataFrame 的那一行是一个非常耗时的过程。

A much faster way of doing this is to read the contents of the input file into a primitive data structure such as a list of lists, or a list of dicts, and then converting that into a DataFrame.

执行此操作的一种更快的方法是将输入文件的内容读取到原始数据结构中,例如列表列表或字典列表,然后将其转换为 DataFrame。

Use lists when all of the data that you're reading in are in the same columns. Otherwise, use dicts to explicitly say which column each bit of data should go to.

当您正在阅读的所有数据都在同一列中时,请使用列表。否则,使用 dicts 来明确说明每一位数据应该去哪一列。

Update Jan 18:This is linked to How to parse complex text files using Python?I also wrote a blog article explaining how to parse complex files to beginners.

1 月 18 日更新:这链接到如何使用 Python 解析复杂的文本文件?我还写了一篇博客文章,向初学者解释如何解析复杂的文件

回答by clocker

Before pulling out the multiprocessing hammer, your first step should be to do some profiling. Use cProfile to quickly look through to identify which functions are taking a long time. Unfortunately if your lines are all in a single function call, they'll show up as library calls. line_profiler is better but takes a little more setup time.

在拔出多处理锤之前,您的第一步应该是进行一些分析。使用 cProfile 快速查看以识别哪些功能需要很长时间。不幸的是,如果您的所有行都在一个函数调用中,它们将显示为库调用。line_profiler 更好,但需要更多的设置时间。

NOTE. If using ipython, you can use %timeit (magic command for the timeit module) and %prun (magic command for the profile module) both to time your statements as well as functions. A google search will show some guides.

笔记。如果使用 ipython,您可以使用 %timeit(用于 timeit 模块的魔术命令)和 %prun(用于配置文件模块的魔术命令)来为您的语句和函数计时。谷歌搜索会显示一些指南。

Pandas is a wonderful library, but I've been an occasional victim of using it poorly with atrocious results. In particular, be wary of append()/concat() operations. That might be your bottleneck but you should profile to be sure. Usually, the numpy.vstack() and numpy.hstack() operations are faster if you don't need to perform index/column alignment. In your case it looks like you might be able to get by with Series or 1-D numpy ndarrays which can save time.

Pandas 是一个很棒的库,但我偶尔会因为它使用不当而导致糟糕的结果而成为受害者。特别要注意 append()/concat() 操作。这可能是你的瓶颈,但你应该确定。通常,如果不需要执行索引/列对齐,numpy.vstack() 和 numpy.hstack() 操作会更快。在您的情况下,您似乎可以使用 Series 或 1-D numpy ndarrays 来节省时间。

BTW, a tryblock in python is much slower often 10x or more than checking for an invalid condition, so be sure you absolutely need it when sticking it into a loop for every single line. This is probably the other hogger of time; I imagine you stuck the try block to check for AttributeError in case of a match.group(1) failure. I would check for a valid match first.

顺便说一句,trypython 中的块通常比检查无效条件慢 10 倍或更多,所以确保在将它放入每一行的循环中时绝对需要它。这可能是另一个浪费时间的人;我想你在 match.group(1) 失败的情况下卡住了 try 块来检查 AttributeError 。我会先检查有效匹配。

Even these small modifications should be enough for your program to run significantly faster before trying anything drastic like multiprocessing. Those Python libraries are awesome but bring a fresh set of challenges to deal with.

即使是这些小的修改也应该足以让您的程序在尝试像多处理这样剧烈的事情之前运行得更快。这些 Python 库很棒,但也带来了一系列新的挑战。

回答by Некто

I've used this many times as it's a particular easy implementation of multiprocessing.

我已经多次使用它,因为它是多处理的一个特别简单的实现。

import pandas as pd
from multiprocessing import Pool

def reader(filename):
    return pd.read_excel(filename)

def main():
    pool = Pool(4) # number of cores you want to use
    file_list = [file1.xlsx, file2.xlsx, file3.xlsx, ...]
    df_list = pool.map(reader, file_list) #creates a list of the loaded df's
    df = pd.concat(df_list) # concatenates all the df's into a single df

if __name__ == '__main__':
    main()

Using this you should be able to substantially increase the speed of your program without too much work at all. If you don't know how many processors you have, you can check by pulling up your shell and typing

使用它,您应该能够在不做太多工作的情况下显着提高程序的速度。如果你不知道你有多少个处理器,你可以通过拉起你的外壳并输入来检查

echo %NUMBER_OF_PROCESSORS%

EDIT: To make this run even faster, consider changing your files to csvs and using pandas function pandas.read_csv

编辑:为了使运行速度更快,请考虑将文件更改为 csvs 并使用 pandas 函数pandas.read_csv

回答by Dmitry Rubanovich

First of all, if you are reading the file in multiple times, it seems like that would be the bottleneck. Try reading the file into 1 string object and then using cStringIOon it multiple times.

首先,如果您多次读取文件,这似乎是瓶颈。尝试将文件读入 1 个字符串对象,然后cStringIO多次使用它。

Second, you haven't really shown any reason to build the indices before reading in all the files. Even if you do, why are you using Pandas for IO? It seems like you can build it up in regular python data structures (maybe using __slots__) and then put it in the master dataframe. If you don't need file X index before you read file Y (as you 2nd loop seems to suggest), you just need to loop over the files once.

其次,在读入所有文件之前,您还没有真正展示任何构建索引的理由。即使你这样做了,你为什么要使用 Pandas 进行 IO?似乎您可以在常规 python 数据结构中构建它(可能使用__slots__),然后将其放入主数据帧中。如果在读取文件 Y 之前不需要文件 X 索引(正如您第二次循环所建议的那样),您只需要遍历文件一次。

Third, you can either use simple split/stripon the strings to pull out space separated tokens, or if it's more complicated (there are string quotes and such) use the CSVmodule from Python's standard library. Until you show how you actually build up your data, it's hard to suggest a fix related to that.

第三,您可以在字符串上使用简单的split/strip来提取空格分隔的标记,或者如果它更复杂(有字符串引号等),请使用CSVPython 标准库中的模块。在您展示如何实际构建数据之前,很难提出与此相关的修复方案。

What you have shown so far can be done fairly quickly with the simple

到目前为止,您所展示的内容可以通过简单的

for path in paths:
    data = []
    with open(path, 'r') as file_obj:
        for line in file_obj:
            try:
                d1, d2 = line.strip().split()
            except ValueError:
                pass
            data.append(d1, int(d2)))
    index, values = zip(*data)
    subset_df = pd.DataFrame({"Number": pd.Series(values, index=index)})

Here's the difference in timings when I run on a virtual machine with the disk space not pre-allocated (the generated files are roughly 24MB in size):

这是我在未预先分配磁盘空间的虚拟机上运行时的时间差异(生成的文件大小约为 24MB):

import pandas as pd
from random import randint
from itertools import combinations
from posix import fsync


outfile = "indexValueInput"

for suffix in ('1', '2'):
    with open(outfile+"_" + suffix, 'w') as f:
        for i, label in enumerate(combinations([chr(i) for i in range(ord('a'), ord('z')+1)], 8)) :
            val = randint(1, 1000000)
            print >>f, "%s %d" % (''.join(label), val)
            if i > 3999999:
                break
        print >>f, "end"
        fsync(f.fileno())

def readWithPandas():
    data = []
    with open(outfile + "_2", 'r') as file_obj:
        for line in file_obj:
            try:
                d1, d2 = str.split(line.strip())
            except ValueError:
                pass
            data.append((d1, int(d2)))
    index, values = zip(*data)
    subset_df = pd.DataFrame({"Numbers": pd.Series(values, index=index)})

def readWithoutPandas():
    data = []
    with open(outfile+"_1", 'r') as file_obj:
        for line in file_obj:
            try:
                d1, d2 = str.split(line.strip())
            except ValueError:
                pass
            data.append((d1, int(d2)))
    index, values = zip(*data)

def time_func(func, *args):
    import time
    print "timing function", str(func.func_name)
    tStart = time.clock()
    func(*args)
    tEnd = time.clock()
    print "%f seconds " % (tEnd - tStart)

time_func(readWithoutPandas)
time_func(readWithPandas)

The resulting times are:

结果时间为:

timing function readWithoutPandas
4.616853 seconds 
timing function readWithPandas
4.931765 seconds 

You can try these functions with your index buildup and see what the difference in time would be. It is almost certain that the slow down comes from multiple disk reads. And since Pandas will take no time to build up your dataframe from a dictionary, you are better off figuring out how to build up your index in pure Python before passing the data to Pandas. But do both the data read and the build up of the index in 1 disk read.

您可以在索引构建中尝试这些函数,并查看时间上的差异。几乎可以肯定,速度减慢来自多次磁盘读取。而且由于 Pandas 不会花时间从字典中构建您的数据框,因此在将数据传递给 Pandas 之前,您最好弄清楚如何在纯 Python 中构建您的索引。但是在 1 个磁盘读取中执行数据读取和索引的构建。

I guess one other caveat is that if you print from inside of your code, expect that to take a huge amount of time. The time it takes to write plain text to a tty dwarves the time it takes to read/write to disk.

我想另一个警告是,如果您从代码内部打印,预计会花费大量时间。将纯文本写入 tty 所需的时间使读取/写入磁盘所需的时间相形见绌。

回答by cgte

General python considerations :

一般python注意事项:

First of all about time measurement you may use such a snippet:

首先关于时间测量,您可以使用这样的片段:

from time import time, sleep


class Timer(object):
    def __init__(self):
        self.last = time()


    def __call__(self):
        old = self.last
        self.last = time()
        return self.last - old

    @property
    def elapsed(self):
        return time() - self.last



timer = Timer()

sleep(2)
print timer.elapsed
print timer()
sleep(1)
print timer()

Then you could benchmark running code many times, and check for the diff.

然后你可以多次对运行代码进行基准测试,并检查差异。

About this, i comment inline :

关于这一点,我内联评论:

with open(path, 'r') as file_obj:
    line = True
    while line: #iterate on realdines instead.
        try:
            line = file_obj.readline()
            match = reg_ex.match(line)
            index += match.group(1)
            #if match:
            #    index.extend(match.group(1)) # or extend

        except AttributeError:
            pass

You previous code wat not really pythonic, you may want to try/except. Then try only on do in on the minimum possible lines.

您以前的代码不是真正的pythonic,您可能想尝试/除外。然后只尝试在尽可能少的行上做。

The same notices apply to the second block of code.

相同的注意事项适用于第二个代码块。

If you need to read the same files multiple times. you could store them in RAM using StringIO or easier keep a {path: content} dict that you only read once.

如果您需要多次读取相同的文件。您可以使用 StringIO 将它们存储在 RAM 中,或者更轻松地保留一个只读取一次的 {path: content} dict。

Python regex are known to be slow, your data seems pretty simple, you may consider using split and strip methods on your inputlines.

众所周知,Python 正则表达式很慢,您的数据看起来很简单,您可以考虑在输入行上使用 split 和 strip 方法。

 striped=[l.split() for l in [c.strip() for c in file_desc.readlines()] if l] 

I recommend you to read this : https://gist.github.com/JeffPaine/6213790the correspondig video is here https://www.youtube.com/watch?v=OSGv2VnC0go

我建议你阅读这个:https: //gist.github.com/JeffPaine/6213790对应的视频在这里https://www.youtube.com/watch?v=OSGv2VnC0go

回答by stovfl

Your code don't do what you describe.

您的代码不符合您的描述。

Question: 1. Read every file and check what the labels are. The information I need is often contained in the first few lines.

问题: 1. 阅读每个文件并检查标签是什么。我需要的信息通常包含在前几行中。

But you read the wholefile, not only a few lines. This result in reading the files twice!

但是您阅读了整个文件,而不仅仅是几行。这导致读取文件两次

Question: 2. Read the file again and fill the dataframe with values.

问题:2. 再次读取文件并用值填充数据框。

You overwrite df['a'|'b'|'c'|'d']in the loop again and again, which is useless
I belive this is not what you want.
This works for Data given in Question, but not if you have to deal with n values.

df['a'|'b'|'c'|'d']一次又一次地在循环中覆盖,这是没用的,
我相信这不是你想要的。
这适用于问题中给出的数据,但不适用于您必须处理 n 个值。



Proposal with a different logic:

具有不同逻辑的提案:

data = {}
for path in paths:
    with open(path, 'r') as file_obj:
        line = True
        while line:
            try:
                line = file_obj.readline()
                match = reg_ex.match(line)
                if match.group(1) not in data:
                    data[ match.group(1) ] = []

                data[match.group(1)].append( match.group(2) )
            except AttributeError:
                pass

print('data=%s' % data)
df = pd.DataFrame.from_dict(data, orient='index').sort_index()
df.rename(index=str, columns={0: "Number"}, inplace=True)  

Output:

输出

data={'b': ['2'], 'a': ['1'], 'd': ['4'], 'c': ['3']}
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 1 columns):
Number    4 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
  Number
a      1
b      2
c      3
d      4  

Time Table:

时间表

             Code from Q:   to_dict_from_dict
    4 values 0:00:00.033071 0:00:00.022146
 1000 values 0:00:08.267750 0:00:05.536500
10000 values 0:01:22.677500 0:00:55.365000

Tested with Python:3.4.2 - pandas:0.19.2 - re:2.2.1

用 Python 测试:3.4.2 - Pandas:0.19.2 - re:2.2.1

回答by Ron Distante

You can import the multiprocessing model and use a pool of worker processes to open multiple files as file objects concurrently, speeding up the loading portion of your code. To test time, either import the datetime function and use the following code:

您可以导入多处理模型并使用工作进程池将多个文件作为文件对象同时打开,从而加快代码的加载部分。要测试时间,请导入 datetime 函数并使用以下代码:

import datetime
start=datetime.datetime.now()

#part of your code goes here

execTime1=datetime.datetime.now()
print(execTime1-start)

#the next part of your code goes here

execTime2=datetime.datetime.now()
print(execTime2-execTime1)

As far as reading each file only once, consider using another multiprocessing script to build a list of lines in each file, so you can check for a match without a file I/O operation.

至于只读取每个文件一次,请考虑使用另一个多处理脚本来构建每个文件中的行列表,这样您就可以在没有文件 I/O 操作的情况下检查匹配项。

回答by EngineeredBrain

First, use a profiler for your script (see this question). Analyze exactly which part is consuming more time. See if you can optimize it.

首先,为您的脚本使用分析器(请参阅此问题)。准确分析哪个部分消耗更多时间。看看能不能优化一下。

Second, I feel that the I/O operation- file reading most likely be the bottleneck. It can be optimized using concurrent approach. I would suggest read files concurrently and create data frame. Each thread can push newly created data frame to a queue. A main thread monitoring queue can pick up data frames from queue and merge it with master data frame.

其次,我觉得I/O操作-文件读取最有可能是瓶颈。可以使用并发方法对其进行优化。我建议同时读取文件并创建数据框。每个线程都可以将新创建​​的数据帧推送到队列中。主线程监控队列可以从队列中拾取数据帧并将其与主数据帧合并。

Hope this helps.

希望这可以帮助。

回答by quester

1 create one output template for files (like result data frame should have column A, B C)

1 为文件创建一个输出模板(如结果数据框应该有 A、BC 列)

2 read every file, transform it into output template (that was established in step 1) and save file like temp_idxx.csv, this can be done in parallel :)

2 读取每个文件,将其转换为输出模板(在步骤 1 中建立)并保存文件,如 temp_idxx.csv,这可以并行完成:)

3 concatenate these temp_idxx.csv files into one massive file and delete temps

3 将这些 temp_idxx.csv 文件连接成一个大文件并删除临时文件

pros of this procedure is that it can be run in parallel, and it will not eat all the memory cons are creating output format and sticking to it, and disk space usage

这个程序的优点是它可以并行运行,它不会吃掉所有的内存缺点是创建输出格式并坚持它,以及磁盘空间使用

回答by blindChicken

Read the files directly into a pandas dataframe using using pd.read_csv. To create your subset_df. Use methods such as skipfooter to skip the lines at the end of the file you know you wont need. There are many more methods available that may replace some of the regex loop functions you are using, such as error_bad_lines and skip_blank_lines.

使用 pd.read_csv 将文件直接读入 Pandas 数据帧。创建您的子集_df。使用skipfooter 之类的方法跳过您知道不需要的文件末尾的行。还有更多可用的方法可以替换您正在使用的一些正则表达式循环函数,例如 error_bad_lines 和 skip_blank_lines。

Then use tools provided by pandas to clean out the data that is not needed.

然后使用pandas提供的工具清理掉不需要的数据。

This will allow you to read the open and read the file only once.

这将允许您读取打开的文件并仅读取一次。