Pandas 使用“更大”的 DataFrames 附加性能连接/附加

Question

提问by MMCM_

The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?

问题：我将数据存储在 csv 文件中，其中包含以下列数据/ID/值。我有 15 个文件，每个文件包含大约 10-20mio 行。每个 csv 文件涵盖一个不同的时期，因此时间索引不重叠，但列是（不时输入新 ID，旧 ID 消失）。我最初所做的是在没有枢轴调用的情况下运行脚本，但随后我在本地计算机（只有 8GB）上遇到了内存问题。由于每个文件中有很多冗余，pivot 起初似乎是一个不错的出路（大约少 2/3 的数据）但现在性能开始发挥作用。如果我运行以下脚本，concat 函数将“永远”运行（我总是打断一段时间后手动到目前为止（2h>））。Concat/append 似乎在大小方面有限制（我大约有 10000-20000 列），或者我在这里错过了什么？有什么建议？

import pandas as pd
path = 'D:\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
    data_tmp = pd.read_csv(path + file, engine='c',
                           compression='gzip',
                           low_memory=False,
                           usecols=['date', 'Value', 'ID'])
    data_tmp = data_tmp.pivot(index='date', columns='ID',
                              values='Value')

    data = pd.concat([data,data_tmp])
    del data_tmp

EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.

编辑我：澄清一下，每个 csv 文件大约有 10-20mio 行和三列，在应用数据透视后，这减少到大约 2000 行，但会导致 10000 列。

I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.

我可以通过简单地将全套 id 拆分为子集并基于每个子集运行所需的计算来解决内存问题，因为它们对于每个 id 都是独立的。我知道这让我重新加载相同的文件 n 次，其中 n 是使用的子集数，但这仍然是合理的速度。我仍然想知道为什么 append 没有执行。

EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.

编辑二：我试图用模拟重新创建文件结构，它尽可能接近实际的数据结构。我希望很清楚，我没有花太多时间来最小化模拟时间，但它在我的机器上运行得相当快。

import string
import random
import pandas as pd
import numpy as np
import math

# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------

# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
    id_list.append(''.join(
        random.choice(string.ascii_uppercase + string.digits) for _
        in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]

time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size =  math.ceil(len(time_index)/num_files)

data = []
#  Simulate files
for file in range(0, run_to_file):
    tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
    # TODO not all cases cover, make sure ints are obtained
    tmp_ids = id_list[file * id_interval:
        start_ids + (file + 1) * id_interval]

    tmp_data = pd.DataFrame(np.random.standard_normal(
        (len(tmp_time), len(tmp_ids))), index=tmp_time,
        columns=tmp_ids)

    tmp_file = tmp_data.stack().sortlevel(1).reset_index()
    # final simulated data structure of the parsed csv file
    tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
                                        'ID', 0: 'Value'})

    # comment/uncomment if pivot takes place on aggregate level or not
    tmp_file = tmp_file.pivot(index='Date', columns='ID',
                              values='Value')
    data.append(tmp_file)

data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')

Answer 1

回答by joris

Using your reproducible example code, I can indeed confirm that the concatof only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:

使用您的可重现示例代码，我确实可以确认concat只有两个数据帧需要很长时间。但是，如果您首先对齐它们（使列名相等），则连接非常快：

In [94]: df1, df2 = data[0], data[1]

In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop

In [99]: %%timeit
   ....: df1b, df2b = df1.align(df2, axis=1)
   ....: pd.concat([df1b, df2b])
   ....:
1 loops, best of 3: 686 ms per loop

The result of both approaches is the same.
The aligning is equivalent to:

两种方法的结果是一样的。
对齐相当于：

common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)

So this is probably the easier way to use when having to deal with a full list of dataframes.

因此，当必须处理完整的数据帧列表时，这可能是更简单的使用方法。

The reason that pd.concatis slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so muchslower surprises me as well, but I will raise an issue about that.

pd.concat较慢的原因是因为它做得更多。例如，当列名不相等时，它检查每一列是否必须向上转换 dtype 或不保存 NaN 值（通过对齐列名引入）。通过调整自己，你可以跳过这个。但在这种情况下，您肯定拥有相同的 dtype，这没问题。
它的速度如此之慢也让我感到惊讶，但我会提出一个问题。

Answer 2

回答by MMCM_

Summary, three key performance drivers depending on the set-up:

总结，取决于设置的三个关键性能驱动因素：

1) Make sure datatype are the same when concatenating two dataframes

1) 连接两个数据帧时确保数据类型相同

2) Use integer based column names if possible

2) 如果可能，使用基于整数的列名

3) When using string based columns, make sure to use the align method before concat is called as suggested by joris

3）当使用基于字符串的列时，请确保在调用 concat 之前使用 align 方法，如 joris 所建议的

Answer 3

回答by Alexander

As @joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:

正如@joris 所提到的，您应该将所有数据透视表附加到一个列表中，然后一次性将它们全部连接起来。这是对您的代码的建议修改：

dfs = []
for file in raw_files:
    data_tmp = pd.read_csv(path + file, engine='c',
                           compression='gzip',
                           low_memory=False,
                           usecols=['date', 'Value', 'ID'])
    data_tmp = data_tmp.pivot(index='date', columns='ID',
                              values='Value')
    dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)

Pandas 使用“更大”的 DataFrames 附加性能连接/附加

提问by MMCM_

回答by joris

回答by MMCM_

回答by Alexander

相关推荐

最近更新

标签

Pandas 使用“更大”的 DataFrames 附加性能连接/附加

提问by MMCM_

回答by joris

回答by MMCM_

回答by Alexander

相关推荐

pandas 在 Python 中将 datetime.datetime 对象转换为自纪元以来的天数

pandas 熊猫 to_csv 标题与列

从 MySQL 加载 500 万行到 Pandas

从 Pandas 中的 DatetimeIndex 制作月份和年份列表

相关推荐

最近更新

标签