pandas 理解熊猫中的 lambda 函数
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49069624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
understanding lambda functions in pandas
提问by thileepan
I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github.
我正在尝试解决 Python 课程的问题,发现有人在 github 中为同一问题实施了解决方案。我只是想了解 github 中给出的解决方案。
I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype.
我有一个名为 Top15 的 Pandas 数据框,有 15 个国家/地区,数据框中的一列是“HighRenew”。此列存储每个国家/地区使用的可再生能源的百分比。我的任务是将“HighRenew”列中的列值转换为布尔数据类型。
If the value for a particular country is higher than the median renewable energy percentage in all the 15 countries then I should encode it as 1 otherwise it should a 0. The 'HighRenew' column is sliced out as a Series from the dataframe, which is copied below.
如果某个特定国家/地区的值高于所有 15 个国家/地区的可再生能源百分比中值,那么我应该将其编码为 1,否则它应该为 0。“HighRenew”列从数据框中切出为一个系列,即复制如下。
Country
China True
United States False
Japan False
United Kingdom False
Russian Federation True
Canada True
Germany True
India False
France True
South Korea False
Italy True
Spain True
Iran False
Australia False
Brazil True
Name: HighRenew, dtype: bool
The github solution is implemented in 3 steps, of which I understand the first 2 but not the last one where lambdafunction is used. Can someone explain how this lambda function works?
github 解决方案分 3 个步骤实现,其中我理解使用lambda函数的前 2 个但不是最后一个。有人可以解释这个 lambda 函数是如何工作的吗?
median_value = Top15['% Renewable'].median()
Top15['HighRenew'] = Top15['% Renewable']>=median_value
Top15['HighRenew'] = Top15['HighRenew'].apply(lambda x:1 if x else 0)
采纳答案by jezrael
I think apply
are loops under the hood, better is use vectorized astype
- it convert True
to 1
and False
to 0
:
我觉得apply
是在引擎盖下的循环,更好地在使用矢量astype
-它转换True
到1
和False
到0
:
Top15['HighRenew'] = (Top15['% Renewable']>=median_value).astype(int)
lambda x:1 if x else 0
means anonymous function (lambda
function) with condition - if True
return 1
else return 0
.
表示lambda
具有条件的匿名函数(函数) - if True
return 1
else return 0
。
For more information about lambda
function check this answers.
有关lambda
功能的更多信息,请查看此答案。
回答by jpp
lambda
represents an anonymous (i.e. unnamed) function. If it is used with pd.Series.apply
, each element of the series is fed into the lambda
function. The result will be another pd.Series
with each element run through the lambda
.
lambda
表示匿名(即未命名)函数。如果与 一起使用pd.Series.apply
,则该系列的每个元素都将输入到lambda
函数中。结果将是另一个pd.Series
,每个元素都通过lambda
.
apply
+ lambda
is just a thinly veiled loop. You should prefer to use vectorised functionality where possible. @jezrael offers such a vectorised solution.
apply
+lambda
只是一个隐藏的循环。在可能的情况下,您应该更喜欢使用矢量化功能。@jezrael 提供了这样的矢量化解决方案。
The equivalent in regular python is below, given a list lst
. Here each element of lst
is passed through the lambda
function and aggregated in a list.
给定 list ,常规 python 中的等效项如下lst
。这里的每个元素lst
都通过lambda
函数传递并聚合在一个列表中。
list(map(lambda x: 1 if x else 0, lst))
It is a Pythonic idiom to test for "Truthy" values using if x
rather than if x == True
, see this answerfor more information on what is considered True
.
使用if x
而不是测试“真实”值是 Pythonic 的习惯用法if x == True
,请参阅此答案以获取有关所考虑内容的更多信息True
。
回答by Brandon Barney
Instead of using workarounds or lambdas, just use Panda's built-in functionality meant for this problem. The approach is called masking, and in essence we use comparators against a Series
(column of a df) to get the boolean values:
不要使用变通方法或 lambdas,只需使用 Panda 的内置功能即可解决此问题。该方法称为屏蔽,本质上我们使用比较器对Series
(df 的列)来获取布尔值:
import pandas as pd
import numpy as np
foo = [{
'Country': 'Germany',
'Percent Renew': 100
}, {
'Country': 'Germany',
'Percent Renew': 75
}, {
'Country': 'China',
'Percent Renew': 25
}, {
'Country': 'USA',
'Percent Renew': 5
}]
df = pd.DataFrame(foo, index=pd.RangeIndex(0, len(foo)))
df
#| Country | Percent Renew |
#| Germany | 100 |
#| Australia | 75 |
#| China | 25 |
#| USA | 5 |
np.mean(df['Percent Renew'])
# 51.25
df['Better Than Average'] = df['Percent Renew'] > np.mean(df['Percent Renew'])
#| Country | Percent Renew | Better Than Average |
#| Germany | 100 | True
#| Australia | 75 | True
#| China | 25 | False
#| USA | 5 | False
The reason specifically why I propose this over the other solutions is that masking can be used for a host of other purposes as well. I wont get into them here, but once you learn that pandas supports this kind of functionality, it becomes a lot easier to perform other data manipulations in pandas.
我之所以提出这一点而不是其他解决方案的具体原因是掩蔽也可用于许多其他目的。我不会在这里讨论它们,但是一旦您了解到 Pandas 支持这种功能,在 Pandas 中执行其他数据操作就会变得容易得多。
EDIT:I read needing boolean
datatype as needing True
False
and not as needing the encoded version 1
and 0
in which case the astype
that was proposed will sufficiently convert the booleans to integer values. For masking purposes though, the 'True' 'False' is needed for slicing.
编辑:我认为需要boolean
数据类型是需要True
False
而不是需要编码版本1
,0
在这种情况下astype
,所提出的将充分地将布尔值转换为整数值。但是,出于屏蔽目的,切片需要“真”“假”。