Python 将函数应用于熊猫数据框的每一行以创建两个新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15118111/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Apply function to each row of pandas dataframe to create two new columns
提问by robintw
I have a pandas DataFrame, stcontaining multiple columns:
我有一个 Pandas DataFrame,st包含多列:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23
Data columns:
Date(dd-mm-yy)_Time(hh-mm-ss) 53732 non-null values
Julian_Day 53732 non-null values
AOT_1020 53716 non-null values
AOT_870 53732 non-null values
AOT_675 53188 non-null values
AOT_500 51687 non-null values
AOT_440 53727 non-null values
AOT_380 51864 non-null values
AOT_340 52852 non-null values
Water(cm) 51687 non-null values
%TripletVar_1020 53710 non-null values
%TripletVar_870 53726 non-null values
%TripletVar_675 53182 non-null values
%TripletVar_500 51683 non-null values
%TripletVar_440 53721 non-null values
%TripletVar_380 51860 non-null values
%TripletVar_340 52846 non-null values
440-870Angstrom 53732 non-null values
380-500Angstrom 52253 non-null values
440-675Angstrom 53732 non-null values
500-870Angstrom 53732 non-null values
340-440Angstrom 53277 non-null values
Last_Processing_Date(dd/mm/yyyy) 53732 non-null values
Solar_Zenith_Angle 53732 non-null values
dtypes: datetime64[ns](1), float64(22), object(1)
I want to create two new columns for this dataframe based on applying a function to each row of the dataframe. I don't want to have to call the function multiple times (eg. by doing two separate applycalls) as it is rather computationally intensive. I have tried doing this in two ways, and neither of them work:
我想基于将函数应用于数据框的每一行,为该数据框创建两个新列。我不想多次调用该函数(例如,通过执行两个单独的apply调用),因为它的计算量相当大。我尝试过以两种方式执行此操作,但它们都不起作用:
Using apply:
使用apply:
I have written a function which takes a Seriesand returns a tuple of the values I want:
我编写了一个函数,它接受 aSeries并返回我想要的值的元组:
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return (a, b)
Trying to apply this to the DataFrame gives an error:
尝试将其应用于 DataFrame 会出现错误:
st.apply(calculate, axis=1)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-248-acb7a44054a7> in <module>()
----> 1 st.apply(calculate, axis=1)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds)
4191 return self._apply_raw(f, axis)
4192 else:
-> 4193 return self._apply_standard(f, axis)
4194 else:
4195 return self._apply_broadcast(f, axis)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures)
4274 index = None
4275
-> 4276 result = self._constructor(data=results, index=index)
4277 result.rename(columns=dict(zip(range(len(res_index)), res_index)),
4278 inplace=True)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
390 mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy)
391 elif isinstance(data, dict):
--> 392 mgr = self._init_dict(data, index, columns, dtype=dtype)
393 elif isinstance(data, ma.MaskedArray):
394 mask = ma.getmaskarray(data)
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)
521
522 return _arrays_to_mgr(arrays, data_names, index, columns,
--> 523 dtype=dtype)
524
525 def _init_ndarray(self, values, index, columns, dtype=None,
C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
5411
5412 # consolidate for now
-> 5413 mgr = BlockManager(blocks, axes)
5414 return mgr.consolidate()
5415
C:\Python27\lib\site-packages\pandas\core\internals.pyc in __init__(self, blocks, axes, do_integrity_check)
802
803 if do_integrity_check:
--> 804 self._verify_integrity()
805
806 self._consolidate_check()
C:\Python27\lib\site-packages\pandas\core\internals.pyc in _verify_integrity(self)
892 "items")
893 if block.values.shape[1:] != mgr_shape[1:]:
--> 894 raise AssertionError('Block shape incompatible with manager')
895 tot_items = sum(len(x.items) for x in self.blocks)
896 if len(self.items) != tot_items:
AssertionError: Block shape incompatible with manager
I was then going to assign the values returned from applyto two new columns using the method shown in this question. However, I can't even get to this point! This all works fine if I just return one value.
然后,我将apply使用此问题中显示的方法将返回的值分配给两个新列。然而,我什至无法达到这一点!如果我只返回一个值,这一切都很好。
Using a loop:
使用循环:
I first created two new columns of the dataframe and set them to None:
我首先创建了数据框的两个新列并将它们设置为None:
st['a'] = None
st['b'] = None
Then looped over all of the indices and tried to modify these Nonevalues that I'd got in there, but the modifications I did didn't seem to work. That is, no error was generated, but the DataFrame didn't seem to be modified.
然后循环遍历所有索引并尝试修改None我在那里获得的这些值,但我所做的修改似乎不起作用。也就是说,没有产生错误,但是DataFrame似乎没有被修改。
for i in st.index:
# do calc here
st.ix[i]['a'] = a
st.ix[i]['b'] = b
I thought that both of these methods would work, but neither of them did. So, what am I doing wrong here? And what is the best, most 'pythonic' and 'pandaonic' way to do this?
我以为这两种方法都行,但都没有。那么,我在这里做错了什么?什么是最好的,最“pythonic”和“pandaonic”的方式来做到这一点?
采纳答案by Garrett
To make the first approach work, try returning a Series instead of a tuple (apply is throwing an exception because it doesn't know how to glue the rows back together as the number of columns doesn't match the original frame).
要使第一种方法起作用,请尝试返回 Series 而不是元组(apply 抛出异常,因为它不知道如何将行重新粘在一起,因为列数与原始帧不匹配)。
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return pd.Series(dict(col1=a, col2=b))
The second approach should work if you replace:
如果您更换,第二种方法应该有效:
st.ix[i]['a'] = a
with:
和:
st.ix[i, 'a'] = a
回答by SebastianNeubauer
This was solved here: Apply pandas function to column to create multiple new columns?
这在这里解决了: Apply pandas function to column to create multiple new columns?
Applied to your question this should work:
应用于您的问题,这应该有效:
def calculate(s):
a = s['path'] + 2*s['row'] # Simple calc for example
b = s['path'] * 0.153
return pd.Series({'col1': a, 'col2': b})
df = df.merge(df.apply(calculate, axis=1), left_index=True, right_index=True)
回答by Russell_A
I always use lambdas and the built-in map()function to create new rows by combining other rows:
我总是使用 lambdas 和内置map()函数通过组合其他行来创建新行:
st['a'] = map(lambda path, row: path + 2 * row, st['path'], st['row'])
It might be slightly more complicated than necessary for doing linear combinations of numerical columns. On the other hand, I feel it's good to adopt as a convention as it can be used with more complicated combinations of rows (e.g. working with strings) or filling missing data in a column using functions of the other columns.
它可能比进行数字列的线性组合所需的稍微复杂一些。另一方面,我觉得作为惯例采用是很好的,因为它可以用于更复杂的行组合(例如使用字符串)或使用其他列的函数填充列中的缺失数据。
For example, lets say you have a table with columns gender, and title, and some of the titles are missing. You can fill them with a function as follows:
例如,假设您有一个包含性别和标题列的表,但缺少某些标题。您可以使用如下函数填充它们:
title_dict = {'male': 'mr.', 'female': 'ms.'}
table['title'] = map(lambda title,
gender: title if title != None else title_dict[gender],
table['title'], table['gender'])
回答by Dra?ko Koki?
Yet another solution based on Assigning New Columns in Method Chains:
另一种基于在方法链中分配新列的解决方案:
st.assign(a = st['path'] + 2*st['row'], b = st['path'] * 0.153)
Be aware assignalwaysreturns a copy of the data, leaving the original DataFrame untouched.
请注意,assign始终返回数据的副本,而保持原始 DataFrame 不变。

