pandas 寻找多个县之间最大的人口差异?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/41878666/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding the largest difference in population among multiple counties?
提问by Dick Thompson
I'm learning pandas on python and can't seem to finish this problem. There are 6 population columns, POPESTIMATE2010 to POPESTIMATE 2016, and I need to find the county with the largest population change between these years. (e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.)
我正在 python 上学习Pandas,似乎无法完成这个问题。有 6 个人口列,POPESTIMATE2010 到 POPESTIMATE 2016,我需要找到这些年之间人口变化最大的县。(例如,如果 5 年期间的县人口为 100、120、80、105、100、130,则其在该期间的最大变化将是 |130-80| = 50。)
What I've done so far is manage to manipulate the data into an array and list, but I'm not sure which is better to solve this problem:
到目前为止我所做的是设法将数据操作到数组和列表中,但我不确定哪个更好地解决这个问题:
import numpy as np
def answer_seven():
sumlev = census_df.SUMLEV.values == 50
data = census_df[['POPESTIMATE2010', 'POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015', 'CTYNAME']].values[sumlev]
s = pd.Series(data[:, 0], [data[:, 1], data[:, 2], data[:, 3], data[:, 4], data[:, 5], data[:, 6]], dtype=np.int64)
return data
answer_seven()
Output when I return the data:
返回数据时的输出:
array([[54660, 55253, 55175, ..., 55290, 55347, 'Autauga County'],
[183193, 186659, 190396, ..., 199713, 203709, 'Baldwin County'],
[27341, 27226, 27159, ..., 26815, 26489, 'Barbour County'],
...,
[21102, 20912, 20989, ..., 20903, 20822, 'Uinta County'],
[8545, 8469, 8443, ..., 8316, 8328, 'Washakie County'],
[7181, 7114, 7065, ..., 7185, 7234, 'Weston County']], dtype=object)
And I get a list when I return the s list:
当我返回 s 列表时,我会得到一个列表:
55253 55175 55038 55290 55347 Autauga County 54660
186659 190396 195126 199713 203709 Baldwin County 183193
27226 27159 26973 26815 26489 Barbour County 27341
22733 22642 22512 22549 22583 Bibb County 22861
57711 57776 57734 57658 57673 Blount County 57373
10629 10606 10628 10829 10696 Bullock County 10887
20673 20408 20261 20276 20154 Butler County 20944
117768 117286 116575 115993 115620 Calhoun County 118437
33993 34075 34153 34052 34123 Chambers County 34098
26080 26023 26084 25995 25859 Cherokee County 25976
43739 43697 43795 43921 43943 Chilton County 43665
13593 13543 13378 13289 13170 Choctaw County 13841
25570 25144 25116 24847 24675 Clarke County 25767
13670 13456 13467 13538 13555 Clay County 13880
14971 14921 15028 15072 15018 Cleburne County 14973
50448 51173 50755 50831 51211 Coffee County 50177
54443 54472 54471 54480 54354 Colbert County 54514
13121 12996 12875 12662 12672 Conecuh County 13208
11348 11195 11059 10807 10724 Coosa County 11758
38060 37818 37830 37888 37835 Covington County 37796
13896 13951 13932 13948 13963 Crenshaw County 13853
80469 80374 80756 81221 82005 Cullman County 80473
50109 50324 49833 49501 49565 Dale County 50358
43178 42777 42021 41662 41131 Dallas County 43803
71387 70942 70869 71012 71130 DeKalb County 71142
80012 80432 80883 81022 81468 Elmore County 79465
38213 38034 37857 37784 37789 Escambia County 38309
104236 104235 103852 103452 103057 Etowah County 104442
17062 16960 16857 16842 16759 Fayette County 17231
31729 31648 31507 31592 31696 Franklin County 31734
...
I've looked multiple forum posts, but I can't find anything that really relates to this. I know that the best way to do this is to create a 'HIGHEST' column and a 'LOWEST' column, then find the county with the largest difference, but I don't know how to find the max/min of values within an array. Really appreciate the help!
我查看了多个论坛帖子,但找不到与此真正相关的任何内容。我知道最好的方法是创建一个“最高”列和一个“最低”列,然后找到差异最大的县,但我不知道如何在一个范围内找到最大值/最小值大批。真的很感谢帮助!
回答by dotcs
Given the data you have mentioned (limited to just a few lines for demo purposes) let's first transform it into a proper DataFrame:
鉴于您提到的数据(仅限于演示目的的几行)让我们首先将其转换为适当的 DataFrame:
from io import StringIO
dataset = """\
55253 55175 55038 55290 55347 Autauga County 54660
186659 190396 195126 199713 203709 Baldwin County 183193
27226 27159 26973 26815 26489 Barbour County 27341
22733 22642 22512 22549 22583 Bibb County 22861
57711 57776 57734 57658 57673 Blount County 57373
"""
df = pd.DataFrame.from_csv(StringIO(dataset), sep='\s{2,}', header=None).reset_index()
df.columns = ['y1', 'y2', 'y3', 'y4', 'y5', 'name', 'y6']
df = df.set_index('name')
df.head()
y1 y2 y3 y4 y5 y6
name
Autauga County 55253 55175 55038 55290 55347 54660
Baldwin County 186659 190396 195126 199713 203709 183193
Barbour County 27226 27159 26973 26815 26489 27341
Bibb County 22733 22642 22512 22549 22583 22861
Blount County 57711 57776 57734 57658 57673 57373
Then you can use numpy's min
and max
methods to calculate the minimum and maximum values within the dataset. Afterwards you can create a new DataFrame consisting of the largest diffs. No need for any loops within python which are slow compared to the optimised methods in pandas or numpy.
然后您可以使用 numpymin
和max
方法来计算数据集中的最小值和最大值。之后,您可以创建一个包含最大差异的新 DataFrame。与 Pandas 或 numpy 中的优化方法相比,python 中不需要任何缓慢的循环。
df2 = DataFrame((np.max(df.values, axis=1) - np.min(df.values, axis=1)), index=df.index, columns=['largest_diff'])
df2.head()
largest_diff
name
Autauga County 687
Baldwin County 20516
Barbour County 852
Bibb County 349
Blount County 403
回答by user3664441
I think this should solve your problem
我认为这应该可以解决您的问题
temp = census_df[census_df['SUMLEV'] == 50].set_index('CTYNAME')
yrs = ['POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
res = temp.loc[:,yrs].max(axis=1) - temp.loc[:,yrs].min(axis=1)
res.idxmax()
回答by Rich L
If you have your data in a pandas dataframe to begin with, then use the pandas min() and max() methods:
如果您的数据在 pandas 数据框中,请使用 pandas min() 和 max() 方法:
>>> df1
year: 2010 2011 2012 2013 2014
city
abilene 47000 2000 31000 72000 47000
boise 44000 55000 68000 17000 63000
calgary 39000 86000 6000 97000 1000
denver 57000 52000 46000 0 43000
>>> df1.T.max()-df1.T.min()
city
abilene 70000
boise 51000
calgary 96000
denver 57000
dtype: int32
回答by FamousJameous
Here is my naive implementation.
这是我天真的实现。
maxchange = (None,0)
for row in data:
low = min(row[:-1])
high = max(row[:-1])
if high-low > maxchange[1]:
maxchange = (row[-1], high-low)
print(maxchange)
This uses the data
array created in answer_seven
. This simply finds the min and the max for each county and finds the maximum difference between the counties.
这使用data
在answer_seven
. 这只是找到每个县的最小值和最大值,并找到县之间的最大差异。
回答by epattaro
try this:
尝试这个:
def df_max_dif (x):
max_dif = 0
for ind in x.index:
max_value = np.max(np.abs(x-x.loc[ind]))
if max_value > max_dif:
max_dif = max_value
return max_dif
df['max_dif'] = np.nan
for indx in df.index:
df.loc[indx,'max_dif'] = df_max_dif(df.loc[indx].drop('max_dif'))
hope it helps!
希望能帮助到你!
回答by Tolis
def my_idea():
columns_to_keep = ['POPESTIMATE2015','POPESTIMATE2014','POPESTIMATE2013','POPESTIMATE2012','POPESTIMATE2011','POPESTIMATE2010']
copy_df = census_df[columns_to_keep]
# max_difference_per_country is a Series with sorted values from high to low
max_difference_per_country = (copy_df.max(axis=1) - copy_df.min(axis=1)).sort_values(ascending=False)
# get its index
index_of_max_difference_per_country = max_difference_per_country.first_valid_index()
return census_df['CTYNAME'].iloc[index_of_max_difference_per_country]
回答by MeysaM
The answer provided by @Tolis is does not exclude States and gives 'Texas' as the result. Correct code should be like this:
@Tolis 提供的答案不排除各州,结果为“德克萨斯州”。正确的代码应该是这样的:
def answer_seven():
columns_to_keep = ['POPESTIMATE2015','POPESTIMATE2014','POPESTIMATE2013','POPESTIMATE2012','POPESTIMATE2011','POPESTIMATE2010']
rows_to_keep = census_df[census_df['SUMLEV'] == 50]
copy_df = rows_to_keep[columns_to_keep]
# max_difference_per_country is a Series with sorted values from high to low
max_difference_per_country = (copy_df.max(axis=1) - copy_df.min(axis=1)).sort_values(ascending=False)
# get its index
index_of_max_difference_per_country = max_difference_per_country.first_valid_index()
return census_df['CTYNAME'].iloc[index_of_max_difference_per_country]
回答by yogs
def answer_seven():
county = census_df[census_df['SUMLEV']==50]
county= county.set_index('CTYNAME')
req_col = ['POPESTIMATE2010',
'POPESTIMATE2011',
'POPESTIMATE2012',
'POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']
countyP= county[req_col]
res = (countyP[req_col].max(axis=1) - countyP[req_col].min(axis=1)).nlargest(1)
return res.argmax()
answer_seven()
回答by u11490824
one line can make it
一根线就能搞定
def answer_seven():
cols = [ 'POPESTIMATE2010','POPESTIMATE2011','POPESTIMATE2012','POPESTIMATE2013','POPESTIMATE2014','POPESTIMATE2015' ]
new = census_df[ census_df['SUMLEV']==50 ].set_index('CTYNAME').apply( lambda x: np.max( x[cols] - np.min( x[cols]) ), axis=1)
return new.idxmax()