Python 将函数应用于熊猫数据框的每一行以创建两个新列

Question

提问by robintw

I have a pandas DataFrame, stcontaining multiple columns:

我有一个 Pandas DataFrame，st包含多列：

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23
Data columns:
Date(dd-mm-yy)_Time(hh-mm-ss)       53732  non-null values
Julian_Day                          53732  non-null values
AOT_1020                            53716  non-null values
AOT_870                             53732  non-null values
AOT_675                             53188  non-null values
AOT_500                             51687  non-null values
AOT_440                             53727  non-null values
AOT_380                             51864  non-null values
AOT_340                             52852  non-null values
Water(cm)                           51687  non-null values
%TripletVar_1020                    53710  non-null values
%TripletVar_870                     53726  non-null values
%TripletVar_675                     53182  non-null values
%TripletVar_500                     51683  non-null values
%TripletVar_440                     53721  non-null values
%TripletVar_380                     51860  non-null values
%TripletVar_340                     52846  non-null values
440-870Angstrom                     53732  non-null values
380-500Angstrom                     52253  non-null values
440-675Angstrom                     53732  non-null values
500-870Angstrom                     53732  non-null values
340-440Angstrom                     53277  non-null values
Last_Processing_Date(dd/mm/yyyy)    53732  non-null values
Solar_Zenith_Angle                  53732  non-null values
dtypes: datetime64[ns](1), float64(22), object(1)

I want to create two new columns for this dataframe based on applying a function to each row of the dataframe. I don't want to have to call the function multiple times (eg. by doing two separate applycalls) as it is rather computationally intensive. I have tried doing this in two ways, and neither of them work:

我想基于将函数应用于数据框的每一行，为该数据框创建两个新列。我不想多次调用该函数（例如，通过执行两个单独的apply调用），因为它的计算量相当大。我尝试过以两种方式执行此操作，但它们都不起作用：

Using apply:

使用apply：

I have written a function which takes a Seriesand returns a tuple of the values I want:

我编写了一个函数，它接受 aSeries并返回我想要的值的元组：

def calculate(s):
    a = s['path'] + 2*s['row'] # Simple calc for example
    b = s['path'] * 0.153
    return (a, b)

Trying to apply this to the DataFrame gives an error:

尝试将其应用于 DataFrame 会出现错误：

st.apply(calculate, axis=1)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-248-acb7a44054a7> in <module>()
----> 1 st.apply(calculate, axis=1)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds)
   4191                     return self._apply_raw(f, axis)
   4192                 else:
-> 4193                     return self._apply_standard(f, axis)
   4194             else:
   4195                 return self._apply_broadcast(f, axis)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures)
   4274                 index = None
   4275 
-> 4276             result = self._constructor(data=results, index=index)
   4277             result.rename(columns=dict(zip(range(len(res_index)), res_index)),
   4278                           inplace=True)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
    390             mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy)
    391         elif isinstance(data, dict):
--> 392             mgr = self._init_dict(data, index, columns, dtype=dtype)
    393         elif isinstance(data, ma.MaskedArray):
    394             mask = ma.getmaskarray(data)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)
    521 
    522         return _arrays_to_mgr(arrays, data_names, index, columns,
--> 523                               dtype=dtype)
    524 
    525     def _init_ndarray(self, values, index, columns, dtype=None,

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5411 
   5412     # consolidate for now
-> 5413     mgr = BlockManager(blocks, axes)
   5414     return mgr.consolidate()
   5415 

C:\Python27\lib\site-packages\pandas\core\internals.pyc in __init__(self, blocks, axes, do_integrity_check)
    802 
    803         if do_integrity_check:
--> 804             self._verify_integrity()
    805 
    806         self._consolidate_check()

C:\Python27\lib\site-packages\pandas\core\internals.pyc in _verify_integrity(self)
    892                                      "items")
    893             if block.values.shape[1:] != mgr_shape[1:]:
--> 894                 raise AssertionError('Block shape incompatible with manager')
    895         tot_items = sum(len(x.items) for x in self.blocks)
    896         if len(self.items) != tot_items:

AssertionError: Block shape incompatible with manager

I was then going to assign the values returned from applyto two new columns using the method shown in this question. However, I can't even get to this point! This all works fine if I just return one value.

然后，我将apply使用此问题中显示的方法将返回的值分配给两个新列。然而，我什至无法达到这一点！如果我只返回一个值，这一切都很好。

Using a loop:

使用循环：

I first created two new columns of the dataframe and set them to None:

我首先创建了数据框的两个新列并将它们设置为None：

st['a'] = None
st['b'] = None

Then looped over all of the indices and tried to modify these Nonevalues that I'd got in there, but the modifications I did didn't seem to work. That is, no error was generated, but the DataFrame didn't seem to be modified.

然后循环遍历所有索引并尝试修改None我在那里获得的这些值，但我所做的修改似乎不起作用。也就是说，没有产生错误，但是DataFrame似乎没有被修改。

for i in st.index:
    # do calc here
    st.ix[i]['a'] = a
    st.ix[i]['b'] = b

I thought that both of these methods would work, but neither of them did. So, what am I doing wrong here? And what is the best, most 'pythonic' and 'pandaonic' way to do this?

我以为这两种方法都行，但都没有。那么，我在这里做错了什么？什么是最好的，最“pythonic”和“pandaonic”的方式来做到这一点？

Answer 1

采纳答案by Garrett

To make the first approach work, try returning a Series instead of a tuple (apply is throwing an exception because it doesn't know how to glue the rows back together as the number of columns doesn't match the original frame).

要使第一种方法起作用，请尝试返回 Series 而不是元组（apply 抛出异常，因为它不知道如何将行重新粘在一起，因为列数与原始帧不匹配）。

def calculate(s):
    a = s['path'] + 2*s['row'] # Simple calc for example
    b = s['path'] * 0.153
    return pd.Series(dict(col1=a, col2=b))

The second approach should work if you replace:

如果您更换，第二种方法应该有效：

st.ix[i]['a'] = a

with:

和：

st.ix[i, 'a'] = a

Answer 2

回答by SebastianNeubauer

This was solved here: Apply pandas function to column to create multiple new columns?

这在这里解决了： Apply pandas function to column to create multiple new columns?

Applied to your question this should work:

应用于您的问题，这应该有效：

def calculate(s):
    a = s['path'] + 2*s['row'] # Simple calc for example
    b = s['path'] * 0.153
    return pd.Series({'col1': a, 'col2': b})

df = df.merge(df.apply(calculate, axis=1), left_index=True, right_index=True)

Answer 3

回答by Russell_A

I always use lambdas and the built-in map()function to create new rows by combining other rows:

我总是使用 lambdas 和内置map()函数通过组合其他行来创建新行：

st['a'] = map(lambda path, row: path + 2 * row, st['path'], st['row'])

It might be slightly more complicated than necessary for doing linear combinations of numerical columns. On the other hand, I feel it's good to adopt as a convention as it can be used with more complicated combinations of rows (e.g. working with strings) or filling missing data in a column using functions of the other columns.

它可能比进行数字列的线性组合所需的稍微复杂一些。另一方面，我觉得作为惯例采用是很好的，因为它可以用于更复杂的行组合（例如使用字符串）或使用其他列的函数填充列中的缺失数据。

For example, lets say you have a table with columns gender, and title, and some of the titles are missing. You can fill them with a function as follows:

例如，假设您有一个包含性别和标题列的表，但缺少某些标题。您可以使用如下函数填充它们：

title_dict = {'male': 'mr.', 'female': 'ms.'}
table['title'] = map(lambda title,
    gender: title if title != None else title_dict[gender],
    table['title'], table['gender'])

Answer 4

回答by Dra?ko Koki?

Yet another solution based on Assigning New Columns in Method Chains:

另一种基于在方法链中分配新列的解决方案：

st.assign(a = st['path'] + 2*st['row'], b = st['path'] * 0.153)

Be aware assignalwaysreturns a copy of the data, leaving the original DataFrame untouched.

请注意，assign始终返回数据的副本，而保持原始 DataFrame 不变。

Python 将函数应用于熊猫数据框的每一行以创建两个新列

提问by robintw

采纳答案by Garrett

回答by SebastianNeubauer

回答by Russell_A

回答by Dra?ko Koki?

相关推荐

最近更新

标签

Python 将函数应用于熊猫数据框的每一行以创建两个新列

提问by robintw

采纳答案by Garrett

回答by SebastianNeubauer

回答by Russell_A

回答by Dra?ko Koki?

相关推荐

Python 的列表是如何实现的？

python - 如何使用Selenium WebDriver和python获取Web元素的颜色？

Python .py 和 .pyc 文件有什么区别？

如何更新 Python？

相关推荐

最近更新

标签