Python 熊猫数据框中的自定义排序

Question

提问by Kathirmani Sukumar

I have python pandas dataframe, in which a column contains month name.

我有 python pandas 数据框，其中一列包含月份名称。

How can I do a custom sort using a dictionary, for example:

如何使用字典进行自定义排序，例如：

custom_dict = {'March':0, 'April':1, 'Dec':3}

Answer 1

采纳答案by Andy Hayden

Pandas 0.15 introduced Categorical Series, which allows a much clearer way to do this:

Pandas 0.15 引入了Categorical Series，它提供了一种更清晰的方法来做到这一点：

First make the month column a categorical and specify the ordering to use.

首先将月份列设为分类列并指定要使用的排序。

In [21]: df['m'] = pd.Categorical(df['m'], ["March", "April", "Dec"])

In [22]: df  # looks the same!
Out[22]:
   a  b      m
0  1  2  March
1  5  6    Dec
2  3  4  April

Now, when you sort the month column it will sort with respect to that list:

现在，当您对月份列进行排序时，它将根据该列表进行排序：

In [23]: df.sort_values("m")
Out[23]:
   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

Note: if a value is not in the list it will be converted to NaN.

注意：如果一个值不在列表中，它将被转换为 NaN。

An older answer for those interested...

对于那些有兴趣的人来说，这是一个较旧的答案......

You could create an intermediary series, and set_indexon that:

您可以创建一个中间系列，然后set_index：

df = pd.DataFrame([[1, 2, 'March'],[5, 6, 'Dec'],[3, 4, 'April']], columns=['a','b','m'])
s = df['m'].apply(lambda x: {'March':0, 'April':1, 'Dec':3}[x])
s.sort_values()

In [4]: df.set_index(s.index).sort()
Out[4]: 
   a  b      m
0  1  2  March
1  3  4  April
2  5  6    Dec

As commented, in newer pandas, Series has a replacemethod to do this more elegantly:

正如评论的那样，在较新的熊猫中，Series 有一种replace方法可以更优雅地做到这一点：

s = df['m'].replace({'March':0, 'April':1, 'Dec':3})

The slight difference is that this won't raise if there is a value outside of the dictionary (it'll just stay the same).

略有不同的是，如果字典之外有值，则不会提高（它会保持不变）。

Answer 2

回答by eumiro

import pandas as pd
custom_dict = {'March':0,'April':1,'Dec':3}

df = pd.DataFrame(...) # with columns April, March, Dec (probably alphabetically)

df = pd.DataFrame(df, columns=sorted(custom_dict, key=custom_dict.get))

returns a DataFrame with columns March, April, Dec

返回一个包含三月、四月、十二月的 DataFrame

Answer 3

回答by Michael Delgado

Update: The accepted answeris now the right way to do this. Leaving my old answer below for posterity, but if you encounter this post - look above and propser.

更新：接受的答案现在是正确的方法。将我的旧答案留给后人，但如果您遇到这篇文章-请看上面和道具。

Original post

原帖

A bit late to the game, but here's a way to create a function that sorts pandas Series, DataFrame, and multiindex DataFrame objects using arbitrary functions.

游戏有点晚了，但这里有一种方法可以创建一个使用任意函数对 Pandas Series、DataFrame 和多索引 DataFrame 对象进行排序的函数。

I make use of the df.iloc[index]method, which references a row in a Series/DataFrame by position (compared to df.loc, which references by value). Using this, we just have to have a function that returns a series of positional arguments:

我使用该df.iloc[index]方法，该方法按位置引用 Series/DataFrame 中的一行（与df.loc按值引用的相比）。使用它，我们只需要一个返回一系列位置参数的函数：

def sort_pd(key=None,reverse=False,cmp=None):
    def sorter(series):
        series_list = list(series)
        return [series_list.index(i) 
           for i in sorted(series_list,key=key,reverse=reverse,cmp=cmp)]
    return sorter

You can use this to create custom sorting functions. This works on the dataframe used in Andy Hayden's answer:

您可以使用它来创建自定义排序功能。这适用于安迪海登的回答中使用的数据框：

df = pd.DataFrame([
    [1, 2, 'March'],
    [5, 6, 'Dec'],
    [3, 4, 'April']], 
  columns=['a','b','m'])

custom_dict = {'March':0, 'April':1, 'Dec':3}
sort_by_custom_dict = sort_pd(key=custom_dict.get)

In [6]: df.iloc[sort_by_custom_dict(df['m'])]
Out[6]:
   a  b  m
0  1  2  March
2  3  4  April
1  5  6  Dec

This also works on multiindex DataFrames and Series objects:

这也适用于多索引 DataFrames 和 Series 对象：

months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

df = pd.DataFrame([
    ['New York','Mar',12714],
    ['New York','Apr',89238],
    ['Atlanta','Jan',8161],
    ['Atlanta','Sep',5885],
  ],columns=['location','month','sales']).set_index(['location','month'])

sort_by_month = sort_pd(key=months.index)

In [10]: df.iloc[sort_by_month(df.index.get_level_values('month'))]
Out[10]:
                 sales
location  month  
Atlanta   Jan    8161
New York  Mar    12714
          Apr    89238
Atlanta   Sep    5885

sort_by_last_digit = sort_pd(key=lambda x: x%10)

In [12]: pd.Series(list(df['sales'])).iloc[sort_by_last_digit(df['sales'])]
Out[12]:
2    8161
0   12714
3    5885
1   89238

To me this feels clean, but it uses python operations heavily rather than relying on optimized pandas operations. I haven't done any stress testing but I'd imagine this could get slow on very large DataFrames. Not sure how the performance compares to adding, sorting, then deleting a column. Any tips on speeding up the code would be appreciated!

对我来说这感觉很干净，但它大量使用 python 操作而不是依赖优化的熊猫操作。我没有做过任何压力测试，但我想这在非常大的 DataFrame 上可能会变慢。不确定与添加、排序和删除列的性能相比如何。任何有关加速代码的提示将不胜感激！

Answer 4

回答by cs95

pandas >= 1.1

熊猫 >= 1.1

You will soon be able to use sort_valueswith keyargument:

您很快就可以使用sort_valueswithkey参数：

custom_dict = {'March': 0, 'April': 1, 'Dec': 3} 
df

   a  b      m
0  1  2  March
1  5  6    Dec
2  3  4  April

df.sort_values(key=lambda x: x.map(custom_dict))

   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

pandas <= 1.0.X

熊猫 <= 1.0.X

One simple method is using the output Series.mapand Series.argsortto index into dfusing DataFrame.iloc(since argsort produces sorted integer positions); since you have a dictionary; this becomes easy.

一种简单的方法是使用输出Series.map并Series.argsort索引df使用DataFrame.iloc（因为 argsort 产生排序的整数位置）；因为你有字典；这变得容易。

df.iloc[df['m'].map(custom_dict).argsort()]

   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

If you need to sort in descending order, invert the mapping.

如果您需要按降序排序，请反转映射。

df.iloc[(-df['m'].map(custom_dict)).argsort()]

   a  b      m
1  5  6    Dec
2  3  4  April
0  1  2  March

Note that this only works on numeric items. Otherwise, you will need to workaround this using sort_values, and accessing the index:

请注意，这仅适用于数字项目。否则，您将需要使用sort_values, 并访问索引来解决此问题：

df.loc[df['m'].map(custom_dict).sort_values(ascending=False).index]

   a  b      m
1  5  6    Dec
2  3  4  April
0  1  2  March

More options are available with astype(this is deprecated now), or pd.Categorical, but you need to specify ordered=Truefor it to work correctly.

更多的选择与astype（这是现在不建议使用），或者pd.Categorical，但你需要指定ordered=True为它工作正常。

# Older version,
# df['m'].astype('category', 
#                categories=sorted(custom_dict, key=custom_dict.get), 
#                ordered=True)
df['m'] = pd.Categorical(df['m'], 
                         categories=sorted(custom_dict, key=custom_dict.get), 
                         ordered=True)

Now, a simple sort_valuescall will do the trick:

现在，一个简单的sort_values调用就可以解决问题：

df.sort_values('m')

   a  b      m
0  1  2  March
2  3  4  April
1  5  6    Dec

The categorical ordering will also be honoured when groupbysorts the output.

在groupby对输出进行排序时，也将遵循分类排序。

Python 熊猫数据框中的自定义排序

提问by Kathirmani Sukumar

采纳答案by Andy Hayden

回答by eumiro

回答by Michael Delgado

Original post

原帖

回答by cs95

pandas >= 1.1

熊猫 >= 1.1

pandas <= 1.0.X

熊猫 <= 1.0.X

相关推荐

最近更新

标签

Python 熊猫数据框中的自定义排序

提问by Kathirmani Sukumar

采纳答案by Andy Hayden

回答by eumiro

回答by Michael Delgado

Original post

原帖

回答by cs95

pandas >= 1.1

熊猫 >= 1.1

pandas <= 1.0.X

熊猫 <= 1.0.X

相关推荐

Python Matplotlib 2 个子图，1 个颜色条

Python 如何在本地运行 Postgres

python subprocess.call()“没有这样的文件或目录”

Python Windows 上的 PyCrypto 安装错误

相关推荐

最近更新

标签