Python 将熊猫函数应用于列以创建多个新列？

Question

提问by smci

How to do this in pandas:

如何在熊猫中做到这一点：

I have a function extract_text_featureson a single text column, returning multiple output columns. Specifically, the function returns 6 values.

我extract_text_features在单个文本列上有一个函数，返回多个输出列。具体来说，该函数返回 6 个值。

The function works, however there doesn't seem to be any proper return type (pandas DataFrame/ numpy array/ Python list) such that the output can get correctly assigned df.ix[: ,10:16] = df.textcol.map(extract_text_features)

该函数有效，但是似乎没有任何正确的返回类型（pandas DataFrame/numpy 数组/Python 列表），以便可以正确分配输出 df.ix[: ,10:16] = df.textcol.map(extract_text_features)

So I think I need to drop back to iterating with df.iterrows(), as per this?

所以我想我需要回到迭代中df.iterrows()，按照这个？

UPDATE: Iterating with df.iterrows()is at least 20x slower, so I surrendered and split out the function into six distinct .map(lambda ...)calls.

更新：迭代df.iterrows()至少慢 20 倍，所以我放弃并将函数拆分为六个不同的.map(lambda ...)调用。

UPDATE 2: this question was asked back around v0.11.0. Hence much of the question and answers are not too relevant.

更新 2：这个问题在v0.11.0左右被问回。因此，很多问题和答案都不太相关。

Answer 1

采纳答案by Zelazny7

Building off of user1827356 's answer, you can do the assignment in one pass using df.merge:

基于 user1827356 的回答，您可以使用df.merge以下方法一次性完成任务：

df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})), 
    left_index=True, right_index=True)

    textcol  feature1  feature2
0  0.772692  1.772692 -0.227308
1  0.857210  1.857210 -0.142790
2  0.065639  1.065639 -0.934361
3  0.819160  1.819160 -0.180840
4  0.088212  1.088212 -0.911788

EDIT:Please be aware of the huge memory consumption and low speed: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/!

编辑：请注意巨大的内存消耗和低速：https: //ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/！

Answer 2

回答by user1827356

This is what I've done in the past

这是我过去所做的

df = pd.DataFrame({'textcol' : np.random.rand(5)})

df
    textcol
0  0.626524
1  0.119967
2  0.803650
3  0.100880
4  0.017859

df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))
   feature1  feature2
0  1.626524 -0.373476
1  1.119967 -0.880033
2  1.803650 -0.196350
3  1.100880 -0.899120
4  1.017859 -0.982141

Editing for completeness

完整性编辑

pd.concat([df, df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))], axis=1)
    textcol feature1  feature2
0  0.626524 1.626524 -0.373476
1  0.119967 1.119967 -0.880033
2  0.803650 1.803650 -0.196350
3  0.100880 1.100880 -0.899120
4  0.017859 1.017859 -0.982141

Answer 3

回答by ostrokach

I usually do this using zip:

我通常使用zip：

>>> df = pd.DataFrame([[i] for i in range(10)], columns=['num'])
>>> df
    num
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9

>>> def powers(x):
>>>     return x, x**2, x**3, x**4, x**5, x**6

>>> df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
>>>     zip(*df['num'].map(powers))

>>> df
        num     p1      p2      p3      p4      p5      p6
0       0       0       0       0       0       0       0
1       1       1       1       1       1       1       1
2       2       2       4       8       16      32      64
3       3       3       9       27      81      243     729
4       4       4       16      64      256     1024    4096
5       5       5       25      125     625     3125    15625
6       6       6       36      216     1296    7776    46656
7       7       7       49      343     2401    16807   117649
8       8       8       64      512     4096    32768   262144
9       9       9       81      729     6561    59049   531441

Answer 4

回答by RFox

I've looked several ways of doing this and the method shown here (returning a pandas series) doesn't seem to be most efficient.

我已经研究了几种这样做的方法，这里显示的方法（返回熊猫系列）似乎不是最有效的。

If we start with a largeish dataframe of random data:

如果我们从一个较大的随机数据数据帧开始：

# Setup a dataframe of random numbers and create a 
df = pd.DataFrame(np.random.randn(10000,3),columns=list('ABC'))
df['D'] = df.apply(lambda r: ':'.join(map(str, (r.A, r.B, r.C))), axis=1)
columns = 'new_a', 'new_b', 'new_c'

The example shown here:

此处显示的示例：

# Create the dataframe by returning a series
def method_b(v):
    return pd.Series({k: v for k, v in zip(columns, v.split(':'))})
%timeit -n10 -r3 df.D.apply(method_b)

10 loops, best of 3: 2.77 s per loop

10 个循环，最好的 3 个：每个循环 2.77 秒

An alternative method:

另一种方法：

# Create a dataframe from a series of tuples
def method_a(v):
    return v.split(':')
%timeit -n10 -r3 pd.DataFrame(df.D.apply(method_a).tolist(), columns=columns)

10 loops, best of 3: 8.85 ms per loop

10 个循环，最好的 3 个：每个循环 8.85 毫秒

By my reckoning it's far more efficient to take a series of tuples and then convert that to a DataFrame. I'd be interested to hear people's thinking though if there's an error in my working.

根据我的估计，采用一系列元组然后将其转换为 DataFrame 效率要高得多。如果我的工作出现错误，我很想听听人们的想法。

Answer 5

回答by Michael David Watson

This is the correct and easiest way to accomplish this for 95% of use cases:

对于 95% 的用例，这是实现此目的的正确且最简单的方法：

>>> df = pd.DataFrame(zip(*[range(10)]), columns=['num'])
>>> df
    num
0    0
1    1
2    2
3    3
4    4
5    5

>>> def example(x):
...     x['p1'] = x['num']**2
...     x['p2'] = x['num']**3
...     x['p3'] = x['num']**4
...     return x

>>> df = df.apply(example, axis=1)
>>> df
    num  p1  p2  p3
0    0   0   0    0
1    1   1   1    1
2    2   4   8   16
3    3   9  27   81
4    4  16  64  256

Answer 6

回答by Evan W.

Summary:If you only want to create a few columns, use df[['new_col1','new_col2']] = df[['data1','data2']].apply( function_of_your_choosing(x), axis=1)

总结：如果你只想创建几列，使用df[['new_col1','new_col2']] = df[['data1','data2']].apply( function_of_your_choosing(x), axis=1)

For this solution, the number of new columns you are creating must be equal to the number columns you use as input to the .apply() function. If you want to do something else, have a look at the other answers.

对于此解决方案，您创建的新列数必须等于您用作 .apply() 函数输入的列数。如果您想做其他事情，请查看其他答案。

DetailsLet's say you have two-column dataframe. The first column is a person's height when they are 10; the second is said person's height when they are 20.

详细信息假设您有两列数据框。第一列是一个人10岁时的身高；第二个是说这个人20岁时的身高。

Suppose you need to calculate both the mean of each person's heights and sum of each person's heights. That's two values per each row.

假设您需要计算每个人身高的平均值和每个人身高的总和。这是每行两个值。

You could do this via the following, soon-to-be-applied function:

您可以通过以下即将应用的函数执行此操作：

def mean_and_sum(x):
    """
    Calculates the mean and sum of two heights.
    Parameters:
    :x -- the values in the row this function is applied to. Could also work on a list or a tuple.
    """

    sum=x[0]+x[1]
    mean=sum/2
    return [mean,sum]

You might use this function like so:

你可以像这样使用这个函数：

 df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)

(To be clear: this apply function takes in the values from each row in the subsetted dataframe and returns a list.)

（需要明确的是：这个 apply 函数从子集数据框中的每一行中获取值并返回一个列表。）

However, if you do this:

但是，如果您这样做：

df['Mean_&_Sum'] = df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)

you'll create 1 new column that contains the [mean,sum] lists, which you'd presumably want to avoid, because that would require another Lambda/Apply.

您将创建 1 个包含 [mean,sum] 列表的新列，您可能希望避免这种情况，因为这需要另一个 Lambda/Apply。

Instead, you want to break out each value into its own column. To do this, you can create two columns at once:

相反，您希望将每个值分解到其自己的列中。为此，您可以一次创建两列：

df[['Mean','Sum']] = df[['height_at_age_10','height_at_age_20']]
.apply(mean_and_sum(x),axis=1)

Answer 7

回答by Ted Petrou

The accepted solution is going to be extremely slow for lots of data. The solution with the greatest number of upvotes is a little difficult to read and also slow with numeric data. If each new column can be calculated independently of the others, I would just assign each of them directly without using apply.

对于大量数据，公认的解决方案将非常缓慢。拥有最多赞成票的解决方案有点难以阅读，而且数字数据也很慢。如果每个新列都可以独立于其他列计算，我会直接分配每个新列而不使用apply.

Example with fake character data

带有假字符数据的示例

Create 100,000 strings in a DataFrame

在 DataFrame 中创建 100,000 个字符串

df = pd.DataFrame(np.random.choice(['he jumped', 'she ran', 'they hiked'],
                                   size=100000, replace=True),
                  columns=['words'])
df.head()
        words
0     she ran
1     she ran
2  they hiked
3  they hiked
4  they hiked

Let's say we wanted to extract some text features as done in the original question. For instance, let's extract the first character, count the occurrence of the letter 'e' and capitalize the phrase.

假设我们想像原始问题中那样提取一些文本特征。例如，让我们提取第一个字符，计算字母“e”的出现次数并将短语大写。

df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
df.head()
        words first  count_e         cap
0     she ran     s        1     She ran
1     she ran     s        1     She ran
2  they hiked     t        2  They hiked
3  they hiked     t        2  They hiked
4  they hiked     t        2  They hiked

Timings

时间安排

%%timeit
df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
127 ms ± 585 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

def extract_text_features(x):
    return x[0], x.count('e'), x.capitalize()

%timeit df['first'], df['count_e'], df['cap'] = zip(*df['words'].apply(extract_text_features))
101 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Surprisingly, you can get better performance by looping through each value

令人惊讶的是，您可以通过循环遍历每个值来获得更好的性能

%%timeit
a,b,c = [], [], []
for s in df['words']:
    a.append(s[0]), b.append(s.count('e')), c.append(s.capitalize())

df['first'] = a
df['count_e'] = b
df['cap'] = c
79.1 ms ± 294 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Another example with fake numeric data

另一个假数字数据的例子

Create 1 million random numbers and test the powersfunction from above.

创建 100 万个随机数并powers从上面测试函数。

df = pd.DataFrame(np.random.rand(1000000), columns=['num'])


def powers(x):
    return x, x**2, x**3, x**4, x**5, x**6

%%timeit
df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
       zip(*df['num'].map(powers))
1.35 s ± 83.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Assigning each column is 25x faster and very readable:

分配每一列的速度提高了 25 倍，并且非常易读：

%%timeit 
df['p1'] = df['num'] ** 1
df['p2'] = df['num'] ** 2
df['p3'] = df['num'] ** 3
df['p4'] = df['num'] ** 4
df['p5'] = df['num'] ** 5
df['p6'] = df['num'] ** 6
51.6 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I made a similar response with more details hereon why applyis typically not the way to go.

我在这里做了类似的回应，并提供了更多关于为什么apply通常不可行的详细信息。

Answer 8

回答by Saket Bajaj

you can return the entire row instead of values:

您可以返回整行而不是值：

df = df.apply(extract_text_features,axis = 1)

where the function returns the row

函数返回行的地方

def extract_text_features(row):
      row['new_col1'] = value1
      row['new_col2'] = value2
      return row

Answer 9

回答by Ben

In 2018, I use apply()with argument result_type='expand'

在 2018 年，我使用apply()with 参数result_type='expand'

>>> appiled_df = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
>>> df = pd.concat([df, appiled_df], axis='columns')

Answer 10

回答by Dmytro Bugayev

Have posted the same answer in two other similar questions. The way I prefer to do this is to wrap up the return values of the function in a series:

在其他两个类似的问题中发布了相同的答案。我更喜欢这样做的方式是将函数的返回值包装在一个系列中：

def f(x):
    return pd.Series([x**2, x**3])

And then use apply as follows to create separate columns:

然后按如下方式使用 apply 来创建单独的列：

df[['x**2','x**3']] = df.apply(lambda row: f(row['x']), axis=1)

Python 将熊猫函数应用于列以创建多个新列？

提问by smci

采纳答案by Zelazny7

回答by user1827356

回答by ostrokach

回答by RFox

回答by Michael David Watson

回答by Evan W.

回答by Ted Petrou

Example with fake character data

带有假字符数据的示例

Another example with fake numeric data

另一个假数字数据的例子

回答by Saket Bajaj

回答by Ben

回答by Dmytro Bugayev

相关推荐

最近更新

标签

Python 将熊猫函数应用于列以创建多个新列？

提问by smci

采纳答案by Zelazny7

回答by user1827356

回答by ostrokach

回答by RFox

回答by Michael David Watson

回答by Evan W.

回答by Ted Petrou

Example with fake character data

带有假字符数据的示例

Another example with fake numeric data

另一个假数字数据的例子

回答by Saket Bajaj

回答by Ben

回答by Dmytro Bugayev

相关推荐

Python 在pandas DataFrame中查找并​​选择列中出现频率最高的数据

被python文件模式“w+”弄糊涂了

Python 使用 apache 和 mod_wsgi 的 Flask hello world 仅显示 webroot 中的文件

Python 全局变量和类功能

相关推荐

最近更新

标签

Python 在pandas DataFrame中查找并选择列中出现频率最高的数据