Python 在熊猫中按自定义列表排序
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23482668/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
sorting by a custom list in pandas
提问by itjcms18
After reading through: http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.sort.html
通读后:http: //pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.sort.html
I still can't seem to figure out how to sort a column by a custom list. Obviously, the default sort is alphabetical. I'll give an example. Here is my (very abridged) dataframe:
我似乎仍然无法弄清楚如何按自定义列表对列进行排序。显然,默认排序是按字母顺序排列的。我举个例子。这是我的(非常精简的)数据框:
Player Year Age Tm G
2967 Cedric Hunter 1991 27 CHH 6
5335 Maurice Baker 2004 25 VAN 7
13950 Ratko Varda 2001 22 TOT 60
6141 Ryan Bowen 2009 34 OKC 52
6169 Adrian Caldwell 1997 31 DAL 81
I want to be able to sort by Player, Year and then Tm. The default sort by Player and Year is fine for me, in normal order. However, I do not want Team sorted alphabetically b/c I want TOT always at the top.
我希望能够按玩家、年份和 Tm 排序。按播放器和年份的默认排序对我来说很好,按正常顺序排列。但是,我不希望团队按字母顺序 b/c 排序,我希望 TOT 始终位于顶部。
Here is the list I created:
这是我创建的列表:
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
After reading through the link above, I thought this would work but it didn't:
阅读完上面的链接后,我认为这会起作用,但它没有:
df.sort(['Player', 'Year', 'Tm'], ascending = [True, True, sorter])
It still has ATL at the top, meaning that it sorted alphabetically and not according to my custom list. Any help would really be greatly appreciated, I just can't figure this out.
它仍然在顶部有 ATL,这意味着它按字母顺序排序,而不是根据我的自定义列表排序。任何帮助都将不胜感激,我只是想不通。
采纳答案by Guillaume Jacquenot
Below is an example that performs lexicographic sort on a dataframe. The idea is to create an numerical index based on the specific sort. Then to perform a numerical sort based on the index. A column is added to the dataframe to do so, and is then removed.
下面是对数据帧执行字典排序的示例。这个想法是基于特定的排序创建一个数字索引。然后根据索引执行数字排序。为此将一列添加到数据框中,然后将其删除。
import pandas as pd
# Create DataFrame
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],\
'Player': ['Cedric Hunter', 'Maurice Baker' ,\
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],\
'Year': [1991 ,2004 ,2001 ,2009 ,1997],\
'Age': [27 ,25 ,22 ,34 ,31],\
'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],\
'G':[6 ,7 ,60 ,52 ,81]})
# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN',\
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',\
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',\
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',\
'WAS', 'WSB']
# Create the dictionary that defines the order for sorting
sorterIndex = dict(zip(sorter,range(len(sorter))))
# Generate a rank column that will be used to sort
# the dataframe numerically
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
# Here is the result asked with the lexicographic sort
# Result may be hard to analyze, so a second sorting is
# proposed next
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Player', 'Year', 'Tm_Rank'], \
ascending = [True, True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)
# Here is an example where 'Tm' is sorted first, that will
# give the first row of the DataFrame df to contain TOT as 'Tm'
df['Tm_Rank'] = df['Tm'].map(sorterIndex)
## NOTE:
## Newer versions of pandas use 'sort_values' instead of 'sort'
df.sort_values(['Tm_Rank', 'Player', 'Year'], \
ascending = [True , True, True], inplace = True)
df.drop('Tm_Rank', 1, inplace = True)
print(df)
回答by dmeu
I just discovered that with pandas 15.1 it is possible to use categorical series (http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#categoricals)
我刚刚发现使用 pandas 15.1 可以使用分类系列(http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#categoricals)
As for your example, lets define the same data-frame and sorter:
至于您的示例,让我们定义相同的数据框和排序器:
import pandas as pd
data = {
'id': [2967, 5335, 13950, 6141, 6169],
'Player': ['Cedric Hunter', 'Maurice Baker',
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],
'Year': [1991, 2004, 2001, 2009, 1997],
'Age': [27, 25, 22, 34, 31],
'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'],
'G': [6, 7, 60, 52, 81]
}
# Create DataFrame
df = pd.DataFrame(data)
# Define the sorter
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']
With the data-frame and sorter, which is a category-order, we can do the following in pandas 15.1:
使用数据框和排序器,这是一个类别顺序,我们可以在 pandas 15.1 中执行以下操作:
# Convert Tm-column to category and in set the sorter as categories hierarchy
# Youc could also do both lines in one just appending the cat.set_categories()
df.Tm = df.Tm.astype("category")
df.Tm.cat.set_categories(sorter, inplace=True)
print(df.Tm)
Out[48]:
0 CHH
1 VAN
2 TOT
3 OKC
4 DAL
Name: Tm, dtype: category
Categories (38, object): [TOT < ATL < BOS < BRK ... UTA < VAN < WAS < WSB]
df.sort_values(["Tm"]) ## 'sort' changed to 'sort_values'
Out[49]:
Age G Player Tm Year id
2 22 60 Ratko Varda TOT 2001 13950
0 27 6 Cedric Hunter CHH 1991 2967
4 31 81 Adrian Caldwell DAL 1997 6169
3 34 52 Ryan Bowen OKC 2009 6141
1 25 7 Maurice Baker VAN 2004 5335
回答by Mithril
My idea is generate sort number by index, then merge sort number into original dataframe
我的想法是按索引生成排序号,然后将排序号合并到原始数据帧中
import pandas as pd
df = pd.DataFrame(
{'id':[2967, 5335, 13950, 6141, 6169],\
'Player': ['Cedric Hunter', 'Maurice Baker' ,\
'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],\
'Year': [1991 ,2004 ,2001 ,2009 ,1997],\
'Age': [27 ,25 ,22 ,34 ,31],\
'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],\
'G':[6 ,7 ,60 ,52 ,81]})
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',
'WAS', 'WSB']
x = pd.DataFrame({'Tm': sorter})
x.index = x.index.set_names('number')
x = x.reset_index()
df = pd.merge(df, x, how='left', on='Tm')
df.sort_values(['Player', 'Year', 'number'], \
ascending = [True, True, True], inplace = True)
df.drop('number', 1, inplace = True)
回答by ALollz
Setting the index then DataFrame.loc
is useful when you need to order by a single custom list. Because loc
will create NaN
rows for values in sorter
that aren't in the DataFrame we'll first find the intersection. This prevents any unwanted upcasting. Any rows with values not in the list are removed.
DataFrame.loc
当您需要按单个自定义列表进行排序时,设置索引非常有用。因为loc
将为不在 DataFrame 中的NaN
值创建行,sorter
我们将首先找到交集。这可以防止任何不需要的向上转换。任何值不在列表中的行都将被删除。
true_sort = [s for s in sorter if s in df.Tm.unique()]
df = df.set_index('Tm').loc[true_sort].reset_index()
Tm id Player Year Age G
0 TOT 13950 Ratko Varda 2001 22 60
1 CHH 2967 Cedric Hunter 1991 27 6
2 DAL 6169 Adrian Caldwell 1997 31 81
3 OKC 6141 Ryan Bowen 2009 34 52
4 VAN 5335 Maurice Baker 2004 25 7
Starting Data:
起始数据:
print(df)
id Player Year Age Tm G
0 2967 Cedric Hunter 1991 27 CHH 6
1 5335 Maurice Baker 2004 25 VAN 7
2 13950 Ratko Varda 2001 22 TOT 60
3 6141 Ryan Bowen 2009 34 OKC 52
4 6169 Adrian Caldwell 1997 31 DAL 81
sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN',
'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',
'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',
'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']