pandas Python如何使用数据框应用方法查找列的平均值

Question

提问by Tarik Hodzic

This is a question on Udacity Data Science Nanodegree and I can't figure it out. The instructions are:

这是一个关于 Udacity 数据科学纳米学位的问题，我想不通。指令是：

Using the dataframe's apply method, create a new Series called avg_medal_countthat indicates the average number of gold, silver, and bronze medals earned amongst countries who earned at least one medal of any kind at the 2014 Sochi Olympics.

使用数据框的 apply 方法，创建一个名为的新系列avg_medal_count，该系列表示在 2014 年索契奥运会上至少获得任何类型奖牌的国家之间获得的金牌、银牌和铜牌的平均数量。

The code I have currently is:

我目前拥有的代码是：

import numpy
from pandas import DataFrame, Series

def avg_medal_count():

 countries = ['Russian Fed.', 'Norway', 'Canada', 'United States',
                 'Netherlands', 'Germany', 'Switzerland', 'Belarus',
                 'Austria', 'France', 'Poland', 'China', 'Korea', 
                 'Sweden', 'Czech Republic', 'Slovenia', 'Japan',
                 'Finland', 'Great Britain', 'Ukraine', 'Slovakia',
                 'Italy', 'Latvia', 'Australia', 'Croatia', 'Kazakhstan']

    gold = [13, 11, 10, 9, 8, 8, 6, 5, 4, 4, 4, 3, 3, 2, 2, 2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    silver = [11, 5, 10, 7, 7, 6, 3, 0, 8, 4, 1, 4, 3, 7, 4, 2, 4, 3, 1, 0, 0, 2, 2, 2, 1, 0]
    bronze = [9, 10, 5, 12, 9, 5, 2, 1, 5, 7, 1, 2, 2, 6, 2, 4, 3, 1, 2, 1, 0, 6, 2, 1, 0, 1]

    olympic_medal_counts = {'country_name':countries,
                            'gold': Series(gold),
                            'silver': Series(silver),
                            'bronze': Series(bronze)}    
    df = DataFrame(olympic_medal_counts)


# YOUR CODE HERE


    return avg_medal_count

I have tried a couple different things such as:

我尝试了几种不同的方法，例如：

avg_medal_count = df.apply(numpy.mean), but get the error saying it could not convert the first column to numeric which makes sense because the first column is a list of countries. How can I use df.applyon only gold, silver and bronze columns? I have tried other variations, but nothing worked. I am pretty sure that I need to use a combination of df.applyand numpy.mean, because that is what I just learned about. Any thoughts?

avg_medal_count = df.apply(numpy.mean)，但收到错误消息，说它无法将第一列转换为数字，这是有道理的，因为第一列是国家/地区列表。如何df.apply仅在金、银和青铜柱上使用？我尝试了其他变体，但没有任何效果。我敢肯定，我需要使用的组合df.apply和numpy.mean，因为这是我刚刚得知。有什么想法吗？

Thanks!

谢谢！

Answer 1

回答by Andrew

I would first modify how you import the data to:

我将首先修改您将数据导入到的方式：

df = DataFrame(olympic_medal_counts).set_index('country_name')

I would then calculate a new column containing the sum of the rows for the toal number of medals per country.

然后，我将计算一个新列，其中包含每个国家/地区的奖牌总数的行总和。

df['medal total'] = df.sum(axis=1)

Results:

结果：

                   bronze  gold  silver  medal total
country_name                                     
Russian Fed.         9    13      11           33
Norway              10    11       5           26
Canada               5    10      10           25
United States       12     9       7           28
Netherlands          9     8       7           24
Germany              5     8       6           19
Switzerland          2     6       3           11
Belarus              1     5       0            6
Austria              5     4       8           17
France               7     4       4           15
Poland               1     4       1            6
China                2     3       4            9
Korea                2     3       3            8
Sweden               6     2       7           15
Czech Republic       2     2       4            8
Slovenia             4     2       2            8
Japan                3     1       4            8
Finland              1     1       3            5
Great Britain        2     1       1            4
Ukraine              1     1       0            2
Slovakia             0     1       0            1
Italy                6     0       2            8
Latvia               2     0       2            4
Australia            1     0       2            3
Croatia              0     0       1            1
Kazakhstan           1     0       0            1

Finally, subset the the DataFrame for rows with medal totals greater than or equal to 1 and find the average of the columns.

最后，对奖牌总数大于或等于 1 的行的 DataFrame 进行子集化，并找到列的平均值。

df[df['medal total'] >= 1].apply(np.mean)

Results:

结果：

bronze          3.807692
gold            3.807692
silver          3.730769
medal total    11.346154

This could also be accomplished in one line using:

这也可以在一行中使用：

df[ df.sum(axis=1) >= 1 ].apply(np.mean)

Answer 2

回答by DataPsycho

I have just used the concept of R language in pandas to solve it and it works. Try this code under # your code here

我刚刚在pandas中使用了R语言的概念来解决它并且它有效。在 # your code here 下试试这个代码

sub_df = df[(df.gold >= 1) | (df.silver >= 1) | (df.bronze >= 1)] ### subsetting the data frame
avg_count = sub_df.mean(axis=0) ### axis 0 for column wise mean

return avg_count

In python 3 IDE (like pycharm) you should use

在python 3 IDE（如pycharm）中，您应该使用

return print(avg_count)

then put the main function outside of the indentation to find the answer

然后将 main 函数放在缩进之外以找到答案

avg_medal_count()

Answer 3

回答by Irtiza Nazar

Neither solution above uses apply as stated in the problem. Use the following:

如问题中所述，上述两种解决方案均不适用。使用以下内容：

# YOUR CODE HERE

sub_series = {'gold': df.gold, 
              'silver': df.silver,
              'bronze': df.bronze
             }

sub_df = DataFrame(sub_series)

avg_medal_count = sub_df.apply(numpy.mean)

return avg_medal_count

The applying numpy.mean on the original df will always return errors due to the text column 'countries'

由于文本列“国家/地区”，在原始 df 上应用 numpy.mean 将始终返回错误

Answer 4

回答by A------2

avg_medal_count = df[['gold', 'silver', 'bronze']].apply(numpy.mean)

Gotta do this because the mean will only apply to the numerical columns and not the countries, which is a string...

必须这样做，因为平均值仅适用于数字列而不适用于国家/地区，这是一个字符串......

Answer 5

回答by miltonsiqueira

avg_medal_count = df.mean()

All countries have more than 1 medal in this set, so there is no need to filter it. In case you need it:

所有国家在这个集合中都有超过 1 个奖牌，所以不需要过滤它。如果您需要它：

avg_medal_count = df[(df.gold + df.silver + df.bronze) > 0].mean()

Panda 0.22.0 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html

Pandas 0.22.0 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html

DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
...
numeric_only: boolean, default None Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
...
numeric_only：boolean，默认 None 只包括 float、int、boolean 列。如果没有，将尝试使用所有内容，然后仅使用数字数据。未为系列实现。

pandas Python如何使用数据框应用方法查找列的平均值

提问by Tarik Hodzic

回答by Andrew

回答by DataPsycho

回答by Irtiza Nazar

回答by A------2

回答by miltonsiqueira

相关推荐

最近更新

标签

pandas Python如何使用数据框应用方法查找列的平均值

提问by Tarik Hodzic

回答by Andrew

回答by DataPsycho

回答by Irtiza Nazar

回答by A------2

回答by miltonsiqueira

相关推荐

pandas 如何获得数据框的简单散点图（最好使用 seaborn）

pandas 如何获得两个数据帧的交集？

pandas 从数据框中的字符串中提取子字符串

Pandas 高效的 VWAP 计算

相关推荐

最近更新

标签