Python 将熊猫函数应用于列以创建多个新列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/16236684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply pandas function to column to create multiple new columns?
提问by smci
How to do this in pandas:
如何在熊猫中做到这一点:
I have a function extract_text_featureson a single text column, returning multiple output columns. Specifically, the function returns 6 values.
我extract_text_features在单个文本列上有一个函数,返回多个输出列。具体来说,该函数返回 6 个值。
The function works, however there doesn't seem to be any proper return type (pandas DataFrame/ numpy array/ Python list) such that the output can get correctly assigned df.ix[: ,10:16] = df.textcol.map(extract_text_features)
该函数有效,但是似乎没有任何正确的返回类型(pandas DataFrame/numpy 数组/Python 列表),以便可以正确分配输出 df.ix[: ,10:16] = df.textcol.map(extract_text_features)
So I think I need to drop back to iterating with df.iterrows(), as per this?
所以我想我需要回到迭代中df.iterrows(),按照这个?
UPDATE:
Iterating with df.iterrows()is at least 20x slower, so I surrendered and split out the function into six distinct .map(lambda ...)calls.
更新:迭代df.iterrows()至少慢 20 倍,所以我放弃并将函数拆分为六个不同的.map(lambda ...)调用。
UPDATE 2: this question was asked back around v0.11.0. Hence much of the question and answers are not too relevant.
更新 2:这个问题在v0.11.0左右被问回。因此,很多问题和答案都不太相关。
采纳答案by Zelazny7
Building off of user1827356 's answer, you can do the assignment in one pass using df.merge:
基于 user1827356 的回答,您可以使用df.merge以下方法一次性完成任务:
df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})),
left_index=True, right_index=True)
textcol feature1 feature2
0 0.772692 1.772692 -0.227308
1 0.857210 1.857210 -0.142790
2 0.065639 1.065639 -0.934361
3 0.819160 1.819160 -0.180840
4 0.088212 1.088212 -0.911788
EDIT:Please be aware of the huge memory consumption and low speed: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/!
编辑:请注意巨大的内存消耗和低速:https: //ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/!
回答by user1827356
This is what I've done in the past
这是我过去所做的
df = pd.DataFrame({'textcol' : np.random.rand(5)})
df
textcol
0 0.626524
1 0.119967
2 0.803650
3 0.100880
4 0.017859
df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))
feature1 feature2
0 1.626524 -0.373476
1 1.119967 -0.880033
2 1.803650 -0.196350
3 1.100880 -0.899120
4 1.017859 -0.982141
Editing for completeness
完整性编辑
pd.concat([df, df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))], axis=1)
textcol feature1 feature2
0 0.626524 1.626524 -0.373476
1 0.119967 1.119967 -0.880033
2 0.803650 1.803650 -0.196350
3 0.100880 1.100880 -0.899120
4 0.017859 1.017859 -0.982141
回答by ostrokach
I usually do this using zip:
我通常使用zip:
>>> df = pd.DataFrame([[i] for i in range(10)], columns=['num'])
>>> df
num
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
>>> def powers(x):
>>> return x, x**2, x**3, x**4, x**5, x**6
>>> df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
>>> zip(*df['num'].map(powers))
>>> df
num p1 p2 p3 p4 p5 p6
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 4 8 16 32 64
3 3 3 9 27 81 243 729
4 4 4 16 64 256 1024 4096
5 5 5 25 125 625 3125 15625
6 6 6 36 216 1296 7776 46656
7 7 7 49 343 2401 16807 117649
8 8 8 64 512 4096 32768 262144
9 9 9 81 729 6561 59049 531441
回答by RFox
I've looked several ways of doing this and the method shown here (returning a pandas series) doesn't seem to be most efficient.
我已经研究了几种这样做的方法,这里显示的方法(返回熊猫系列)似乎不是最有效的。
If we start with a largeish dataframe of random data:
如果我们从一个较大的随机数据数据帧开始:
# Setup a dataframe of random numbers and create a
df = pd.DataFrame(np.random.randn(10000,3),columns=list('ABC'))
df['D'] = df.apply(lambda r: ':'.join(map(str, (r.A, r.B, r.C))), axis=1)
columns = 'new_a', 'new_b', 'new_c'
The example shown here:
此处显示的示例:
# Create the dataframe by returning a series
def method_b(v):
return pd.Series({k: v for k, v in zip(columns, v.split(':'))})
%timeit -n10 -r3 df.D.apply(method_b)
10 loops, best of 3: 2.77 s per loop
10 个循环,最好的 3 个:每个循环 2.77 秒
An alternative method:
另一种方法:
# Create a dataframe from a series of tuples
def method_a(v):
return v.split(':')
%timeit -n10 -r3 pd.DataFrame(df.D.apply(method_a).tolist(), columns=columns)
10 loops, best of 3: 8.85 ms per loop
10 个循环,最好的 3 个:每个循环 8.85 毫秒
By my reckoning it's far more efficient to take a series of tuples and then convert that to a DataFrame. I'd be interested to hear people's thinking though if there's an error in my working.
根据我的估计,采用一系列元组然后将其转换为 DataFrame 效率要高得多。如果我的工作出现错误,我很想听听人们的想法。
回答by Michael David Watson
This is the correct and easiest way to accomplish this for 95% of use cases:
对于 95% 的用例,这是实现此目的的正确且最简单的方法:
>>> df = pd.DataFrame(zip(*[range(10)]), columns=['num'])
>>> df
num
0 0
1 1
2 2
3 3
4 4
5 5
>>> def example(x):
... x['p1'] = x['num']**2
... x['p2'] = x['num']**3
... x['p3'] = x['num']**4
... return x
>>> df = df.apply(example, axis=1)
>>> df
num p1 p2 p3
0 0 0 0 0
1 1 1 1 1
2 2 4 8 16
3 3 9 27 81
4 4 16 64 256
回答by Evan W.
Summary:If you only want to create a few columns, use df[['new_col1','new_col2']] = df[['data1','data2']].apply( function_of_your_choosing(x), axis=1)
总结:如果你只想创建几列,使用df[['new_col1','new_col2']] = df[['data1','data2']].apply( function_of_your_choosing(x), axis=1)
For this solution, the number of new columns you are creating must be equal to the number columns you use as input to the .apply() function. If you want to do something else, have a look at the other answers.
对于此解决方案,您创建的新列数必须等于您用作 .apply() 函数输入的列数。如果您想做其他事情,请查看其他答案。
DetailsLet's say you have two-column dataframe. The first column is a person's height when they are 10; the second is said person's height when they are 20.
详细信息假设您有两列数据框。第一列是一个人10岁时的身高;第二个是说这个人20岁时的身高。
Suppose you need to calculate both the mean of each person's heights and sum of each person's heights. That's two values per each row.
假设您需要计算每个人身高的平均值和每个人身高的总和。这是每行两个值。
You could do this via the following, soon-to-be-applied function:
您可以通过以下即将应用的函数执行此操作:
def mean_and_sum(x):
"""
Calculates the mean and sum of two heights.
Parameters:
:x -- the values in the row this function is applied to. Could also work on a list or a tuple.
"""
sum=x[0]+x[1]
mean=sum/2
return [mean,sum]
You might use this function like so:
你可以像这样使用这个函数:
df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)
(To be clear: this apply function takes in the values from each row in the subsetted dataframe and returns a list.)
(需要明确的是:这个 apply 函数从子集数据框中的每一行中获取值并返回一个列表。)
However, if you do this:
但是,如果您这样做:
df['Mean_&_Sum'] = df[['height_at_age_10','height_at_age_20']].apply(mean_and_sum(x),axis=1)
you'll create 1 new column that contains the [mean,sum] lists, which you'd presumably want to avoid, because that would require another Lambda/Apply.
您将创建 1 个包含 [mean,sum] 列表的新列,您可能希望避免这种情况,因为这需要另一个 Lambda/Apply。
Instead, you want to break out each value into its own column. To do this, you can create two columns at once:
相反,您希望将每个值分解到其自己的列中。为此,您可以一次创建两列:
df[['Mean','Sum']] = df[['height_at_age_10','height_at_age_20']]
.apply(mean_and_sum(x),axis=1)
回答by Ted Petrou
The accepted solution is going to be extremely slow for lots of data. The solution with the greatest number of upvotes is a little difficult to read and also slow with numeric data. If each new column can be calculated independently of the others, I would just assign each of them directly without using apply.
对于大量数据,公认的解决方案将非常缓慢。拥有最多赞成票的解决方案有点难以阅读,而且数字数据也很慢。如果每个新列都可以独立于其他列计算,我会直接分配每个新列而不使用apply.
Example with fake character data
带有假字符数据的示例
Create 100,000 strings in a DataFrame
在 DataFrame 中创建 100,000 个字符串
df = pd.DataFrame(np.random.choice(['he jumped', 'she ran', 'they hiked'],
size=100000, replace=True),
columns=['words'])
df.head()
words
0 she ran
1 she ran
2 they hiked
3 they hiked
4 they hiked
Let's say we wanted to extract some text features as done in the original question. For instance, let's extract the first character, count the occurrence of the letter 'e' and capitalize the phrase.
假设我们想像原始问题中那样提取一些文本特征。例如,让我们提取第一个字符,计算字母“e”的出现次数并将短语大写。
df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
df.head()
words first count_e cap
0 she ran s 1 She ran
1 she ran s 1 She ran
2 they hiked t 2 They hiked
3 they hiked t 2 They hiked
4 they hiked t 2 They hiked
Timings
时间安排
%%timeit
df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
127 ms ± 585 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def extract_text_features(x):
return x[0], x.count('e'), x.capitalize()
%timeit df['first'], df['count_e'], df['cap'] = zip(*df['words'].apply(extract_text_features))
101 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Surprisingly, you can get better performance by looping through each value
令人惊讶的是,您可以通过循环遍历每个值来获得更好的性能
%%timeit
a,b,c = [], [], []
for s in df['words']:
a.append(s[0]), b.append(s.count('e')), c.append(s.capitalize())
df['first'] = a
df['count_e'] = b
df['cap'] = c
79.1 ms ± 294 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another example with fake numeric data
另一个假数字数据的例子
Create 1 million random numbers and test the powersfunction from above.
创建 100 万个随机数并powers从上面测试函数。
df = pd.DataFrame(np.random.rand(1000000), columns=['num'])
def powers(x):
return x, x**2, x**3, x**4, x**5, x**6
%%timeit
df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
zip(*df['num'].map(powers))
1.35 s ± 83.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Assigning each column is 25x faster and very readable:
分配每一列的速度提高了 25 倍,并且非常易读:
%%timeit
df['p1'] = df['num'] ** 1
df['p2'] = df['num'] ** 2
df['p3'] = df['num'] ** 3
df['p4'] = df['num'] ** 4
df['p5'] = df['num'] ** 5
df['p6'] = df['num'] ** 6
51.6 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I made a similar response with more details hereon why applyis typically not the way to go.
回答by Saket Bajaj
you can return the entire row instead of values:
您可以返回整行而不是值:
df = df.apply(extract_text_features,axis = 1)
where the function returns the row
函数返回行的地方
def extract_text_features(row):
row['new_col1'] = value1
row['new_col2'] = value2
return row
回答by Ben
In 2018, I use apply()with argument result_type='expand'
在 2018 年,我使用apply()with 参数result_type='expand'
>>> appiled_df = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
>>> df = pd.concat([df, appiled_df], axis='columns')
回答by Dmytro Bugayev
Have posted the same answer in two other similar questions. The way I prefer to do this is to wrap up the return values of the function in a series:
在其他两个类似的问题中发布了相同的答案。我更喜欢这样做的方式是将函数的返回值包装在一个系列中:
def f(x):
return pd.Series([x**2, x**3])
And then use apply as follows to create separate columns:
然后按如下方式使用 apply 来创建单独的列:
df[['x**2','x**3']] = df.apply(lambda row: f(row['x']), axis=1)

