Python 计算 Pandas 数据框中的新列

Question

提问by davo1979

While there are some similar questions, I can't find a straightforward answer to the following. Note that I am coming from R, and quite new to Pandas.

虽然有一些类似的问题，但我找不到以下问题的直接答案。请注意，我来自 R，对 Pandas 还很陌生。

Say I have a Pandas dataframe, df, that contains two columns: "measure" (unicode with 3 levels) and "Airquality" (numpy.float64).

假设我有一个 Pandas 数据框 df，它包含两列：“measure”（具有 3 个级别的 unicode）和“Airquality”（numpy.float64）。

I want to create a third column named "color", that is based on values in "Airquality". Further, I want to do this separately for each level of "measure". I have succeeded by splitting the df on "measure" using df.loc. I then calculated "color" separately in each df using the following code:

我想创建一个名为“color”的第三列，它基于“Airquality”中的值。此外，我想为每个级别的“度量”分别执行此操作。我已经成功地使用 df.loc 在“度量”上拆分了 df。然后我使用以下代码在每个 df 中分别计算“颜色”：

#calculate the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()

#initialize the column for "color" in df for each "measure" level:
df['color'] = None

#find the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()

#loop through the rows calculating and assigning the value for color,
#again, in df for each "measure" level
for i in range(len(df['Airquality'])):
    df['color'][i] = int(100*df['Airquality'][i]/maxi)]

However, this runs quite slowly with the large dataset I'm working with, and I'm sure there must be a much better way...probably using some Pandas function and likely without splitting the df into three, one for each "measure" level. Posting this in the hopes of learning from one of the many Python geniuses.

但是，对于我正在使用的大型数据集，这运行得非常缓慢，而且我确信必须有更好的方法……可能使用一些 Pandas 函数，并且可能不会将 df 分成三个，每个“度量”一个“ 等级。发布此内容是希望向众多 Python 天才之一学习。

Answer 1

回答by wanaryytel

I'm hardly a genius, but I'd go with pandas apply. Usage i.e. as such:

我算不上天才，但我会和熊猫一起去apply。用法即这样：

df['newcol'] = df.apply(lambda row: row['firstcolval'] * row['secondcolval'], axis=1)

More info in the docs as usual: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

像往常一样在文档中提供更多信息：http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Answer 2

回答by DSM

I think you can use the groupbytools, in particular transform. Starting from a frame (BTW, it's considered customary to present an example dataframe yourself):

我认为您可以使用这些groupby工具，尤其是transform. 从一个框架开始（顺便说一句，习惯上自己展示一个示例数据框架）：

In [21]: df = pd.DataFrame({"measure": ["a","a","b","a","c","c"],
    ...:                    "aq": [10,20,30,20,30,50]})

In [22]: df["colour"] = (100.0 * df["aq"] / 
                         df.groupby("measure")["aq"].transform(max))

In [23]: df
Out[23]: 
   aq measure  colour
0  10       a    50.0
1  20       a   100.0
2  30       b   100.0
3  20       a   100.0
4  30       c    60.0
5  50       c   100.0

which works because we get the right denominator by grouping on the measure column, finding the maximum of the aq column for each different value of measure, and broadcasting it up to the whole frame, which is what this does:

这是有效的，因为我们通过对度量列进行分组来获得正确的分母，找到每个不同度量值的 aq 列的最大值，并将其广播到整个帧，这就是这样做的：

In [24]: df.groupby("measure")["aq"].transform(max)
Out[24]: 
0    20
1    20
2    30
3    20
4    50
5    50
Name: aq, dtype: int64

Python 计算 Pandas 数据框中的新列

提问by davo1979

回答by wanaryytel

回答by DSM

相关推荐

最近更新

标签

Python 计算 Pandas 数据框中的新列

提问by davo1979

回答by wanaryytel

回答by DSM

相关推荐

Python Django 不允许的主机

Python：嵌套的“for”循环

Python 如何从 Pycharm 访问不同的 Anaconda 环境（在 Windows 10 上）

Python labelEncoder 在 sklearn 中的工作

相关推荐

最近更新

标签