pandas 寻找多个县之间最大的人口差异?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41878666/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:52:19  来源:igfitidea点击:

Finding the largest difference in population among multiple counties?

pythonarrayspandas

提问by Dick Thompson

I'm learning pandas on python and can't seem to finish this problem. There are 6 population columns, POPESTIMATE2010 to POPESTIMATE 2016, and I need to find the county with the largest population change between these years. (e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.)

我正在 python 上学习Pandas,似乎无法完成这个问题。有 6 个人口列,POPESTIMATE2010 到 POPESTIMATE 2016,我需要找到这些年之间人口变化最大的县。(例如,如果 5 年期间的县人口为 100、120、80、105、100、130,则其在该期间的最大变化将是 |130-80| = 50。)

What I've done so far is manage to manipulate the data into an array and list, but I'm not sure which is better to solve this problem:

到目前为止我所做的是设法将数据操作到数组和列表中,但我不确定哪个更好地解决这个问题:

import numpy as np
def answer_seven():
    sumlev = census_df.SUMLEV.values == 50
    data = census_df[['POPESTIMATE2010', 'POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015', 'CTYNAME']].values[sumlev]
    s = pd.Series(data[:, 0], [data[:, 1], data[:, 2], data[:, 3], data[:, 4], data[:, 5], data[:, 6]], dtype=np.int64)
return data
answer_seven()

Output when I return the data:

返回数据时的输出:

array([[54660, 55253, 55175, ..., 55290, 55347, 'Autauga County'],
   [183193, 186659, 190396, ..., 199713, 203709, 'Baldwin County'],
   [27341, 27226, 27159, ..., 26815, 26489, 'Barbour County'],
   ..., 
   [21102, 20912, 20989, ..., 20903, 20822, 'Uinta County'],
   [8545, 8469, 8443, ..., 8316, 8328, 'Washakie County'],
   [7181, 7114, 7065, ..., 7185, 7234, 'Weston County']], dtype=object)

And I get a list when I return the s list:

当我返回 s 列表时,我会得到一个列表:

55253   55175   55038   55290   55347   Autauga County         54660
186659  190396  195126  199713  203709  Baldwin County        183193
27226   27159   26973   26815   26489   Barbour County         27341
22733   22642   22512   22549   22583   Bibb County            22861
57711   57776   57734   57658   57673   Blount County          57373
10629   10606   10628   10829   10696   Bullock County         10887
20673   20408   20261   20276   20154   Butler County          20944
117768  117286  116575  115993  115620  Calhoun County        118437
33993   34075   34153   34052   34123   Chambers County        34098
26080   26023   26084   25995   25859   Cherokee County        25976
43739   43697   43795   43921   43943   Chilton County         43665
13593   13543   13378   13289   13170   Choctaw County         13841
25570   25144   25116   24847   24675   Clarke County          25767
13670   13456   13467   13538   13555   Clay County            13880
14971   14921   15028   15072   15018   Cleburne County        14973
50448   51173   50755   50831   51211   Coffee County          50177
54443   54472   54471   54480   54354   Colbert County         54514
13121   12996   12875   12662   12672   Conecuh County         13208
11348   11195   11059   10807   10724   Coosa County           11758
38060   37818   37830   37888   37835   Covington County       37796
13896   13951   13932   13948   13963   Crenshaw County        13853
80469   80374   80756   81221   82005   Cullman County         80473
50109   50324   49833   49501   49565   Dale County            50358
43178   42777   42021   41662   41131   Dallas County          43803
71387   70942   70869   71012   71130   DeKalb County          71142
80012   80432   80883   81022   81468   Elmore County          79465
38213   38034   37857   37784   37789   Escambia County        38309
104236  104235  103852  103452  103057  Etowah County         104442
17062   16960   16857   16842   16759   Fayette County         17231
31729   31648   31507   31592   31696   Franklin County        31734
                                                               ...  

I've looked multiple forum posts, but I can't find anything that really relates to this. I know that the best way to do this is to create a 'HIGHEST' column and a 'LOWEST' column, then find the county with the largest difference, but I don't know how to find the max/min of values within an array. Really appreciate the help!

我查看了多个论坛帖子,但找不到与此真正相关的任何内容。我知道最好的方法是创建一个“最高”列和一个“最低”列,然后找到差异最大的县,但我不知道如何在一个范围内找到最大值/最小值大批。真的很感谢帮助!

回答by dotcs

Given the data you have mentioned (limited to just a few lines for demo purposes) let's first transform it into a proper DataFrame:

鉴于您提到的数据(仅限于演示目的的几行)让我们首先将其转换为适当的 DataFrame:

from io import StringIO

dataset = """\
55253   55175   55038   55290   55347   Autauga County         54660
186659  190396  195126  199713  203709  Baldwin County        183193
27226   27159   26973   26815   26489   Barbour County         27341
22733   22642   22512   22549   22583   Bibb County            22861
57711   57776   57734   57658   57673   Blount County          57373
"""

df = pd.DataFrame.from_csv(StringIO(dataset), sep='\s{2,}', header=None).reset_index()
df.columns = ['y1', 'y2', 'y3', 'y4', 'y5', 'name', 'y6']
df = df.set_index('name')
df.head()

                y1      y2      y3      y4      y5      y6
name                        
Autauga County  55253   55175   55038   55290   55347   54660
Baldwin County  186659  190396  195126  199713  203709  183193
Barbour County  27226   27159   26973   26815   26489   27341
Bibb County     22733   22642   22512   22549   22583   22861
Blount County   57711   57776   57734   57658   57673   57373

Then you can use numpy's minand maxmethods to calculate the minimum and maximum values within the dataset. Afterwards you can create a new DataFrame consisting of the largest diffs. No need for any loops within python which are slow compared to the optimised methods in pandas or numpy.

然后您可以使用 numpyminmax方法来计算数据集中的最小值和最大值。之后,您可以创建一个包含最大差异的新 DataFrame。与 Pandas 或 numpy 中的优化方法相比,python 中不需要任何缓慢的循环。

df2 = DataFrame((np.max(df.values, axis=1) - np.min(df.values, axis=1)), index=df.index, columns=['largest_diff'])
df2.head()

                largest_diff
name    
Autauga County  687
Baldwin County  20516
Barbour County  852
Bibb County     349
Blount County   403

回答by user3664441

I think this should solve your problem

我认为这应该可以解决您的问题

temp = census_df[census_df['SUMLEV'] == 50].set_index('CTYNAME')
yrs = ['POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
res = temp.loc[:,yrs].max(axis=1) - temp.loc[:,yrs].min(axis=1)
res.idxmax()

回答by Rich L

If you have your data in a pandas dataframe to begin with, then use the pandas min() and max() methods:

如果您的数据在 pandas 数据框中,请使用 pandas min() 和 max() 方法:

>>> df1
year:   2010    2011    2012    2013    2014
city                    
abilene 47000   2000    31000   72000   47000
boise   44000   55000   68000   17000   63000
calgary 39000   86000   6000    97000   1000
denver  57000   52000   46000   0       43000

>>> df1.T.max()-df1.T.min()
city
abilene    70000
boise      51000
calgary    96000
denver     57000
dtype: int32

回答by FamousJameous

Here is my naive implementation.

这是我天真的实现。

maxchange = (None,0)
for row in data:
    low = min(row[:-1])
    high = max(row[:-1])
    if high-low > maxchange[1]:
        maxchange = (row[-1], high-low)
print(maxchange)

This uses the dataarray created in answer_seven. This simply finds the min and the max for each county and finds the maximum difference between the counties.

这使用dataanswer_seven. 这只是找到每个县的最小值和最大值,并找到县之间的最大差异。

回答by epattaro

try this:

尝试这个:

def df_max_dif (x):

    max_dif = 0

    for ind in x.index:

        max_value = np.max(np.abs(x-x.loc[ind]))

        if max_value > max_dif:

            max_dif = max_value

    return max_dif

df['max_dif'] = np.nan

for indx in df.index:

    df.loc[indx,'max_dif'] = df_max_dif(df.loc[indx].drop('max_dif'))

hope it helps!

希望能帮助到你!

回答by Tolis

def my_idea():
    columns_to_keep = ['POPESTIMATE2015','POPESTIMATE2014','POPESTIMATE2013','POPESTIMATE2012','POPESTIMATE2011','POPESTIMATE2010']
    copy_df = census_df[columns_to_keep]

    # max_difference_per_country is a Series with sorted values from high to low
    max_difference_per_country = (copy_df.max(axis=1) - copy_df.min(axis=1)).sort_values(ascending=False)
    # get its index
    index_of_max_difference_per_country = max_difference_per_country.first_valid_index()
    return census_df['CTYNAME'].iloc[index_of_max_difference_per_country]

回答by MeysaM

The answer provided by @Tolis is does not exclude States and gives 'Texas' as the result. Correct code should be like this:

@Tolis 提供的答案不排除各州,结果为“德克萨斯州”。正确的代码应该是这样的:

def answer_seven():
     columns_to_keep = ['POPESTIMATE2015','POPESTIMATE2014','POPESTIMATE2013','POPESTIMATE2012','POPESTIMATE2011','POPESTIMATE2010']
     rows_to_keep = census_df[census_df['SUMLEV'] == 50]
     copy_df = rows_to_keep[columns_to_keep]

     # max_difference_per_country is a Series with sorted values from high to low
     max_difference_per_country = (copy_df.max(axis=1) - copy_df.min(axis=1)).sort_values(ascending=False)

     # get its index
     index_of_max_difference_per_country = max_difference_per_country.first_valid_index()

     return census_df['CTYNAME'].iloc[index_of_max_difference_per_country]

回答by yogs

def answer_seven():

    county = census_df[census_df['SUMLEV']==50]
    county= county.set_index('CTYNAME')
    req_col = ['POPESTIMATE2010',
                           'POPESTIMATE2011',
                           'POPESTIMATE2012',
                           'POPESTIMATE2013',
                           'POPESTIMATE2014',
                           'POPESTIMATE2015']
    countyP= county[req_col]

    res = (countyP[req_col].max(axis=1) - countyP[req_col].min(axis=1)).nlargest(1)


    return res.argmax()

answer_seven()

回答by u11490824

one line can make it

一根线就能搞定

def answer_seven():

    cols = [ 'POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015' ]

    new = census_df[ census_df['SUMLEV']==50 ].set_index('CTYNAME').apply( lambda x: np.max( x[cols]  - np.min( x[cols]) ), axis=1)

    return new.idxmax()