Python 计算 Pandas 数据框中的新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41842179/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Calculating new column in Pandas dataframe
提问by davo1979
While there are some similar questions, I can't find a straightforward answer to the following. Note that I am coming from R, and quite new to Pandas.
虽然有一些类似的问题,但我找不到以下问题的直接答案。请注意,我来自 R,对 Pandas 还很陌生。
Say I have a Pandas dataframe, df, that contains two columns: "measure" (unicode with 3 levels) and "Airquality" (numpy.float64).
假设我有一个 Pandas 数据框 df,它包含两列:“measure”(具有 3 个级别的 unicode)和“Airquality”(numpy.float64)。
I want to create a third column named "color", that is based on values in "Airquality". Further, I want to do this separately for each level of "measure". I have succeeded by splitting the df on "measure" using df.loc. I then calculated "color" separately in each df using the following code:
我想创建一个名为“color”的第三列,它基于“Airquality”中的值。此外,我想为每个级别的“度量”分别执行此操作。我已经成功地使用 df.loc 在“度量”上拆分了 df。然后我使用以下代码在每个 df 中分别计算“颜色”:
#calculate the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()
#initialize the column for "color" in df for each "measure" level:
df['color'] = None
#find the maximum value of "Airquality" in df for each "measure" level:
maxi = df['Airquality'].max()
#loop through the rows calculating and assigning the value for color,
#again, in df for each "measure" level
for i in range(len(df['Airquality'])):
df['color'][i] = int(100*df['Airquality'][i]/maxi)]
However, this runs quite slowly with the large dataset I'm working with, and I'm sure there must be a much better way...probably using some Pandas function and likely without splitting the df into three, one for each "measure" level. Posting this in the hopes of learning from one of the many Python geniuses.
但是,对于我正在使用的大型数据集,这运行得非常缓慢,而且我确信必须有更好的方法……可能使用一些 Pandas 函数,并且可能不会将 df 分成三个,每个“度量”一个“ 等级。发布此内容是希望向众多 Python 天才之一学习。
回答by wanaryytel
I'm hardly a genius, but I'd go with pandas apply
. Usage i.e. as such:
我算不上天才,但我会和熊猫一起去apply
。用法即这样:
df['newcol'] = df.apply(lambda row: row['firstcolval'] * row['secondcolval'], axis=1)
More info in the docs as usual: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
像往常一样在文档中提供更多信息:http: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html
回答by DSM
I think you can use the groupby
tools, in particular transform
. Starting from a frame (BTW, it's considered customary to present an example dataframe yourself):
我认为您可以使用这些groupby
工具,尤其是transform
. 从一个框架开始(顺便说一句,习惯上自己展示一个示例数据框架):
In [21]: df = pd.DataFrame({"measure": ["a","a","b","a","c","c"],
...: "aq": [10,20,30,20,30,50]})
In [22]: df["colour"] = (100.0 * df["aq"] /
df.groupby("measure")["aq"].transform(max))
In [23]: df
Out[23]:
aq measure colour
0 10 a 50.0
1 20 a 100.0
2 30 b 100.0
3 20 a 100.0
4 30 c 60.0
5 50 c 100.0
which works because we get the right denominator by grouping on the measure column, finding the maximum of the aq column for each different value of measure, and broadcasting it up to the whole frame, which is what this does:
这是有效的,因为我们通过对度量列进行分组来获得正确的分母,找到每个不同度量值的 aq 列的最大值,并将其广播到整个帧,这就是这样做的:
In [24]: df.groupby("measure")["aq"].transform(max)
Out[24]:
0 20
1 20
2 30
3 20
4 50
5 50
Name: aq, dtype: int64