Python 从 pandas apply() 返回多列

Question

提问by PaulMest

I have a pandas DataFrame, df_test. It contains a column 'size' which represents size in bytes. I've calculated KB, MB, and GB using the following code:

我有一个熊猫数据帧，df_test. 它包含一列“大小”，以字节为单位表示大小。我已经使用以下代码计算了 KB、MB 和 GB：

df_test = pd.DataFrame([
    {'dir': '/Users/uname1', 'size': 994933},
    {'dir': '/Users/uname2', 'size': 109338711},
])

df_test['size_kb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0, grouping=True) + ' KB')
df_test['size_mb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 2, grouping=True) + ' MB')
df_test['size_gb'] = df_test['size'].astype(int).apply(lambda x: locale.format("%.1f", x / 1024.0 ** 3, grouping=True) + ' GB')

df_test


             dir       size       size_kb   size_mb size_gb
0  /Users/uname1     994933      971.6 KB    0.9 MB  0.0 GB
1  /Users/uname2  109338711  106,776.1 KB  104.3 MB  0.1 GB

[2 rows x 5 columns]

I've run this over 120,000 rows and time it takes about 2.97 seconds per column * 3 = ~9 seconds according to %timeit.

根据 %timeit，我已经运行了超过 120,000 行，每列大约需要 2.97 秒 * 3 = ~9 秒。

Is there anyway I can make this faster? For example, can I instead of returning one column at a time from apply and running it 3 times, can I return all three columns in one pass to insert back into the original dataframe?

无论如何我可以让它更快吗？例如，我是否可以不从 apply 一次返回一列并运行它 3 次，而是一次返回所有三列以插入回原始数据帧？

The other questions I've found all want to take multiple values and return a single value. I want to take a single value and return multiple columns.

我发现的其他问题都希望采用多个值并返回单个值。我想取一个值并返回多个列。

Answer 1

采纳答案by Nelz11

This is an old question, but for completeness, you can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1to the apply function applies the function sizesto each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.

这是一个老问题，但为了完整起见，您可以从包含新数据的应用函数返回一个系列，从而避免迭代三次。传递axis=1给 apply 函数将函数应用于sizes数据帧的每一行，返回一个序列以添加到新数据帧。该系列 s 包含新值以及原始数据。

def sizes(s):
    s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return s

df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

Answer 2

回答by FooBar

Generally, to return multiple values, this is what I do

通常，要返回多个值，这就是我所做的

def gimmeMultiple(group):
    x1 = 1
    x2 = 2
    return array([[1, 2]])
def gimmeMultipleDf(group):
    x1 = 1
    x2 = 2
    return pd.DataFrame(array([[1,2]]), columns=['x1', 'x2'])
df['size'].astype(int).apply(gimmeMultiple)
df['size'].astype(int).apply(gimmeMultipleDf)

Returning a dataframe definitively has its perks, but sometimes not required. You can look at what the apply()returns and play a bit with the functions ;)

明确返回数据帧有其好处，但有时不是必需的。您可以查看apply()返回的内容并使用函数进行一些操作；)

Answer 3

回答by Jesse

Use apply and zip will 3 times fast than Series way.

使用 apply 和 zip 将比 Series 方式快 3 倍。

def sizes(s):    
    return locale.format("%.1f", s / 1024.0, grouping=True) + ' KB', \
        locale.format("%.1f", s / 1024.0 ** 2, grouping=True) + ' MB', \
        locale.format("%.1f", s / 1024.0 ** 3, grouping=True) + ' GB'
df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes))

Test result are:

测试结果为：

Separate df.apply(): 

    100 loops, best of 3: 1.43 ms per loop

Return Series: 

    100 loops, best of 3: 2.61 ms per loop

Return tuple:

    1000 loops, best of 3: 819 μs per loop

Answer 4

回答by jaumebonet

Some of the current replies work fine, but I want to offer another, maybe more "pandifyed" option. This works for me with the current pandas 0.23(not sure if it will work in previous versions):

当前的一些回复工作正常，但我想提供另一个，也许更“泛化”的选项。这对我来说适用于当前的Pandas 0.23（不确定它是否适用于以前的版本）：

import pandas as pd

df_test = pd.DataFrame([
  {'dir': '/Users/uname1', 'size': 994933},
  {'dir': '/Users/uname2', 'size': 109338711},
])

def sizes(s):
  a = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
  b = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
  c = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
  return a, b, c

df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes, axis=1, result_type="expand")

Notice that the trick is on the result_typeparameter of apply, that will expand its result into a DataFramethat can be directly assign to new/old columns.

请注意，诀窍在于的result_type参数apply，它将其结果扩展为DataFrame可以直接分配给新/旧列的。

Answer 5

回答by alvaro nortes

Just another readable way. This code will add three new columns and its values, returning series without use parameters in the apply function.

只是另一种可读的方式。此代码将添加三个新列及其值，在 apply 函数中返回不带使用参数的系列。

def sizes(s):

    val_kb = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
    val_mb = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    val_gb = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return pd.Series([val_kb,val_mb,val_gb],index=['size_kb','size_mb','size_gb'])

df[['size_kb','size_mb','size_gb']] = df.apply(lambda x: sizes(x) , axis=1)

A general example from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

来自：https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html 的一般示例

df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

#foo  bar
#0    1    2
#1    1    2
#2    1    2

Answer 6

回答by famaral42

Really cool answers! Thanks Jesse and jaumebonet! Just some observation in regards to:

真的很酷的答案！谢谢杰西和 jaumebonet！只是关于以下方面的一些观察：

zip(* ...
... result_type="expand")

zip(* ...
... result_type="expand")

Although expand is kind of more elegant (pandifyed), zip is at least **2x faster. On this simple example bellow, I got 4x faster.

尽管 expand 更优雅（pandifyed），但zip 至少快了 **2 倍。在下面这个简单的示例中，我的速度提高了4 倍。

import pandas as pd

dat = [ [i, 10*i] for i in range(1000)]

df = pd.DataFrame(dat, columns = ["a","b"])

def add_and_sub(row):
    add = row["a"] + row["b"]
    sub = row["a"] - row["b"]
    return add, sub

df[["add", "sub"]] = df.apply(add_and_sub, axis=1, result_type="expand")
# versus
df["add"], df["sub"] = zip(*df.apply(add_and_sub, axis=1))

Answer 7

回答by Waldeyr Mendes da Silva

It gives a new dataframe with two columns from the original one.

它提供了一个新的数据框，其中包含来自原始数据框的两列。

import pandas as pd
df = ...
df_with_two_columns = df.apply(lambda row:pd.Series([row['column_1'], row['column_2']], index=['column_1', 'column_2']),axis = 1)

Answer 8

回答by Rocky K

The performance between the top answers is significantly varied, and Jesse & famaral42 have already discussed this, but it is worth sharing a fair comparison between the top answers, and elaborating on a subtle but important detail of Jesse's answer: the argument passed in to the function, also affects performance.

顶级答案之间的表现差异很大，Jesse & famaral42 已经讨论过这一点，但值得分享顶级答案之间的公平比较，并详细阐述 Jesse 答案的一个微妙但重要的细节：参数传递给功能，也会影响性能。

(Python 3.7.4, Pandas 1.0.3)

（Python 3.7.4，熊猫 1.0.3）

import pandas as pd
import locale
import timeit


def create_new_df_test():
    df_test = pd.DataFrame([
      {'dir': '/Users/uname1', 'size': 994933},
      {'dir': '/Users/uname2', 'size': 109338711},
    ])
    return df_test


def sizes_pass_series_return_series(series):
    series['size_kb'] = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    series['size_mb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    series['size_gb'] = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return series


def sizes_pass_series_return_tuple(series):
    a = locale.format_string("%.1f", series['size'] / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", series['size'] / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", series['size'] / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c


def sizes_pass_value_return_tuple(value):
    a = locale.format_string("%.1f", value / 1024.0, grouping=True) + ' KB'
    b = locale.format_string("%.1f", value / 1024.0 ** 2, grouping=True) + ' MB'
    c = locale.format_string("%.1f", value / 1024.0 ** 3, grouping=True) + ' GB'
    return a, b, c

Here are the results:

结果如下：

# 1 - Accepted (Nels11 Answer) - (pass series, return series):
9.82 ms ± 377 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2 - Pandafied (jaumebonet Answer) - (pass series, return tuple):
2.34 ms ± 48.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3 - Tuples (pass series, return tuple then zip):
1.36 ms ± 62.8 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4 - Tuples (Jesse Answer) - (pass value, return tuple then zip):
752 μs ± 18.5 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Notice how returning tuples is the fastest method, but what is passed inas an argument, also affects the performance. The difference in the code is subtle but the performance improvement is significant.

请注意如何返回元组是最快的方法，但是什么传递中作为一个参数，也是影响性能。代码中的差异是微妙的，但性能改进是显着的。

Test #4 (passing in a single value) is twice as fast as test #3 (passing in a series), even though the operation performed is ostensibly identical.

测试 #4（传入单个值）的速度是测试 #3（通过一系列）的两倍，即使执行的操作表面上是相同的。

But there's more...

但还有更多...

# 1a - Accepted (Nels11 Answer) - (pass series, return series, new columns exist):
3.23 ms ± 141 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 2a - Pandafied (jaumebonet Answer) - (pass series, return tuple, new columns exist):
2.31 ms ± 39.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# 3a - Tuples (pass series, return tuple then zip, new columns exist):
1.36 ms ± 58.4 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# 4a - Tuples (Jesse Answer) - (pass value, return tuple then zip, new columns exist):
694 μs ± 3.9 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In some cases (#1a and #4a), applying the function to a DataFrame in which the output columns already exist is faster than creating them from the function.

在某些情况下（#1a 和 #4a），将函数应用于已存在输出列的 DataFrame 比从函数创建它们更快。

Here is the code for running the tests:

下面是运行测试的代码：

# Paste and run the following in ipython console. It will not work if you run it from a .py file.
print('\nAccepted Answer (pass series, return series, new columns dont exist):')
df_test = create_new_df_test()
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)
print('Accepted Answer (pass series, return series, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit result = df_test.apply(sizes_pass_series_return_series, axis=1)

print('\nPandafied (pass series, return tuple, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")
print('Pandafied (pass series, return tuple, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test[['size_kb', 'size_mb', 'size_gb']] = df_test.apply(sizes_pass_series_return_tuple, axis=1, result_type="expand")

print('\nTuples (pass series, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))
print('Tuples (pass series, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test.apply(sizes_pass_series_return_tuple, axis=1))

print('\nTuples (pass value, return tuple then zip, new columns dont exist):')
df_test = create_new_df_test()
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))
print('Tuples (pass value, return tuple then zip, new columns exist):')
df_test = create_new_df_test()
df_test = pd.concat([df_test, pd.DataFrame(columns=['size_kb', 'size_mb', 'size_gb'])])
%timeit df_test['size_kb'],  df_test['size_mb'], df_test['size_gb'] = zip(*df_test['size'].apply(sizes_pass_value_return_tuple))

Python 从 pandas apply() 返回多列

提问by PaulMest

采纳答案by Nelz11

回答by FooBar

回答by Jesse

回答by jaumebonet

回答by alvaro nortes

回答by famaral42

回答by Waldeyr Mendes da Silva

回答by Rocky K

相关推荐

最近更新

标签

Python 从 pandas apply() 返回多列

提问by PaulMest

采纳答案by Nelz11

回答by FooBar

回答by Jesse

回答by jaumebonet

回答by alvaro nortes

回答by famaral42

回答by Waldeyr Mendes da Silva

回答by Rocky K

相关推荐

XLRD/Python：使用 for 循环将 Excel 文件读入 dict

Python 3.4 用户输入

Python pandas 数据框的最大大小

Python 防止 matplotlib.pyplot 中的科学记数法

相关推荐

最近更新

标签