pandas 理解熊猫中的 lambda 函数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49069624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:15:17  来源:igfitidea点击:

understanding lambda functions in pandas

pythonpandaslambda

提问by thileepan

I'm trying to solve a problem for a course in Python and found someone has implemented solutions for the same problem in github. I'm just trying to understand the solution given in github.

我正在尝试解决 Python 课程的问题,发现有人在 github 中为同一问题实施了解决方案。我只是想了解 github 中给出的解决方案。

I have a pandas dataframe called Top15 with 15 countries and one of the columns in the dataframe is 'HighRenew'. This column stores the % of renewable energy used in each country. My task is to convert the column values in 'HighRenew' column into boolean datatype.

我有一个名为 Top15 的 Pandas 数据框,有 15 个国家/地区,数据框中的一列是“HighRenew”。此列存储每个国家/地区使用的可再生能源的百分比。我的任务是将“HighRenew”列中的列值转换为布尔数据类型。

If the value for a particular country is higher than the median renewable energy percentage in all the 15 countries then I should encode it as 1 otherwise it should a 0. The 'HighRenew' column is sliced out as a Series from the dataframe, which is copied below.

如果某个特定国家/地区的值高于所有 15 个国家/地区的可再生能源百分比中值,那么我应该将其编码为 1,否则它应该为 0。“HighRenew”列从数据框中切出为一个系列,即复制如下。

Country
China                  True
United States         False
Japan                 False
United Kingdom        False
Russian Federation     True
Canada                 True
Germany                True
India                 False
France                 True
South Korea           False
Italy                  True
Spain                  True
Iran                  False
Australia             False
Brazil                 True
Name: HighRenew, dtype: bool

The github solution is implemented in 3 steps, of which I understand the first 2 but not the last one where lambdafunction is used. Can someone explain how this lambda function works?

github 解决方案分 3 个步骤实现,其中我理解使用lambda函数的前 2 个但不是最后一个。有人可以解释这个 lambda 函数是如何工作的吗?

median_value = Top15['% Renewable'].median()
Top15['HighRenew'] = Top15['% Renewable']>=median_value
Top15['HighRenew'] = Top15['HighRenew'].apply(lambda x:1 if x else 0)

采纳答案by jezrael

I think applyare loops under the hood, better is use vectorized astype- it convert Trueto 1and Falseto 0:

我觉得apply是在引擎盖下的循环,更好地在使用矢量astype-它转换True1False0

Top15['HighRenew'] = (Top15['% Renewable']>=median_value).astype(int)


lambda x:1 if x else 0

means anonymous function (lambdafunction) with condition - if Truereturn 1else return 0.

表示lambda具有条件的匿名函数(函数) - if Truereturn 1else return 0

For more information about lambdafunction check this answers.

有关lambda功能的更多信息,请查看此答案

回答by jpp

lambdarepresents an anonymous (i.e. unnamed) function. If it is used with pd.Series.apply, each element of the series is fed into the lambdafunction. The result will be another pd.Serieswith each element run through the lambda.

lambda表示匿名(即未命名)函数。如果与 一起使用pd.Series.apply,则该系列的每个元素都将输入到lambda函数中。结果将是另一个pd.Series,每个元素都通过lambda.

apply+ lambdais just a thinly veiled loop. You should prefer to use vectorised functionality where possible. @jezrael offers such a vectorised solution.

apply+lambda只是一个隐藏的循环。在可能的情况下,您应该更喜欢使用矢量化功能。@jezrael 提供了这样的矢量化解决方案

The equivalent in regular python is below, given a list lst. Here each element of lstis passed through the lambdafunction and aggregated in a list.

给定 list ,常规 python 中的等效项如下lst。这里的每个元素lst都通过lambda函数传递并聚合在一个列表中。

list(map(lambda x: 1 if x else 0, lst))

It is a Pythonic idiom to test for "Truthy" values using if xrather than if x == True, see this answerfor more information on what is considered True.

使用if x而不是测试“真实”值是 Pythonic 的习惯用法if x == True,请参阅此答案以获取有关所考虑内容的更多信息True

回答by Brandon Barney

Instead of using workarounds or lambdas, just use Panda's built-in functionality meant for this problem. The approach is called masking, and in essence we use comparators against a Series(column of a df) to get the boolean values:

不要使用变通方法或 lambdas,只需使用 Panda 的内置功能即可解决此问题。该方法称为屏蔽,本质上我们使用比较器对Series(df 的列)来获取布尔值:

import pandas as pd
import numpy as np

foo = [{
    'Country': 'Germany',
    'Percent Renew': 100
}, {
    'Country': 'Germany',
    'Percent Renew': 75
}, {
    'Country': 'China',
    'Percent Renew': 25
}, {
    'Country': 'USA',
    'Percent Renew': 5
}]

df = pd.DataFrame(foo, index=pd.RangeIndex(0, len(foo)))

df

#| Country   | Percent Renew |
#| Germany   | 100           |
#| Australia | 75            |
#| China     | 25            |
#| USA       | 5             |

np.mean(df['Percent Renew'])
# 51.25

df['Better Than Average'] = df['Percent Renew'] > np.mean(df['Percent Renew'])

#| Country   | Percent Renew | Better Than Average |
#| Germany   | 100           | True
#| Australia | 75            | True
#| China     | 25            | False
#| USA       | 5             | False

The reason specifically why I propose this over the other solutions is that masking can be used for a host of other purposes as well. I wont get into them here, but once you learn that pandas supports this kind of functionality, it becomes a lot easier to perform other data manipulations in pandas.

我之所以提出这一点而不是其他解决方案的具体原因是掩蔽也可用于许多其他目的。我不会在这里讨论它们,但是一旦您了解到 Pandas 支持这种功能,在 Pandas 中执行其他数据操作就会变得容易得多。

EDIT:I read needing booleandatatype as needing TrueFalseand not as needing the encoded version 1and 0in which case the astypethat was proposed will sufficiently convert the booleans to integer values. For masking purposes though, the 'True' 'False' is needed for slicing.

编辑:我认为需要boolean数据类型是需要TrueFalse而不是需要编码版本10在这种情况下astype,所提出的将充分地将布尔值转换为整数值。但是,出于屏蔽目的,切片需要“真”“假”。