自然排序 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29580978/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:11:13  来源:igfitidea点击:

Naturally sorting Pandas DataFrame

pythonpython-2.7sortingpandasnatsort

提问by agf1997

I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?

我有一个带有我想要自然排序的索引的 Pandas DataFrame。Natsort 似乎不起作用。在构建 DataFrame 之前对索引进行排序似乎没有帮助,因为我对 DataFrame 所做的操作似乎在过程中弄乱了排序。关于如何自然地使用索引的任何想法?

from natsort import natsorted
import pandas as pd

# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted 
c = natsorted(a)

# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)

print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)

采纳答案by EdChum

If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:

如果要对 df 进行排序,只需对索引或数据进行排序并直接分配给 df 的索引,而不是尝试将 df 作为 arg 传递,因为这会产生一个空列表:

In [7]:

df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')

Note that df.index = natsorted(df.index)also works

请注意,这df.index = natsorted(df.index)也有效

if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:

如果您将 df 作为 arg 传递,它会产生一个空列表,在这种情况下,因为 df 是空的(没有列),否则它将返回排序的列,这不是您想要的:

In [10]:

natsorted(df)
Out[10]:
[]

EDIT

编辑

If you want to sort the index so that the data is reordered along with the index then use reindex:

如果要对索引进行排序以便数据与索引一起重新排序,请使用reindex

In [13]:

df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
       0
0hr    0
128hr  1
72hr   2
48hr   3
96hr   4
In [14]:

df = df*2
df
Out[14]:
       0
0hr    0
128hr  2
72hr   4
48hr   6
96hr   8
In [15]:

df.reindex(index=natsorted(df.index))
Out[15]:
       0
0hr    0
48hr   6
72hr   4
96hr   8
128hr  2

Note that you have to assign the result of reindexto either a new df or to itself, it does not accept the inplaceparam.

请注意,您必须将结果分配给reindex新的 df 或它本身,它不接受inplace参数。

回答by SethMMorton

The accepted answeranswers the question being asked. I'd like to also add how to use natsorton columns in a DataFrame, since that will be the next question asked.

接受的答案回答所提出的问题。我还想添加如何natsort在 a 中的列上使用DataFrame,因为这将是下一个问题。

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted, index_natsorted, order_by_index

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df
Out[4]: 
         a   b
0hr     a5  b1
128hr   a1  b1
72hr   a10  b2
48hr    a2  b2
96hr   a12  b1

As the accepted answershows, sorting by the index is fairly straightforward:

正如接受的答案所示,按索引排序非常简单:

In [5]: df.reindex(index=natsorted(df.index))
Out[5]: 
         a   b
0hr     a5  b1
48hr    a2  b2
72hr   a10  b2
96hr   a12  b1
128hr   a1  b1

If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsortprovides the convenience functions index_natsortedand order_by_indexto do just that.

如果要以相同的方式对列进行排序,则需要按照所需列的重新排序顺序对索引进行排序。natsort提供便利的功能index_natsortedorder_by_index做到这一点。

In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip(or itertools.izipon Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...

如果要按任意数量的列(或列和索引)重新排序,可以使用zip(或itertools.izip在 Python2 上)指定对多列进行排序。给出的第一列将是主要排序列,然后是第二列,然后是第三列,等等......

In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]: 
         a   b
0hr     a5  b1
96hr   a12  b1
128hr   a1  b1
48hr    a2  b2
72hr   a10  b2


Here is an alternate method using Categoricalobjects that I have been told by the pandasdevs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndexwhich will allow this method to be used on an index.

这是使用开发Categorical人员告诉我的对象的替代方法pandas是执行此操作的“正确”方法。这需要(据我所知)pandas >= 0.16.0。目前,它仅适用于列,但显然在 Pandas >= 0.17.0 中,他们将添加CategoricalIndex这将允许在索引上使用此方法。

In [1]: from pandas import DataFrame

In [2]: from natsort import natsorted

In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])

In [4]: df.a = df.a.astype('category')

In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)

In [6]: df.b = df.b.astype('category')

In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)

In [9]: df.sort('a')
Out[9]: 
         a   b
128hr   a1  b1
48hr    a2  b2
0hr     a5  b1
72hr   a10  b2
96hr   a12  b1

In [10]: df.sort('b')
Out[10]: 
         a   b
0hr     a5  b1
128hr   a1  b1
96hr   a12  b1
72hr   a10  b2
48hr    a2  b2

In [11]: df.sort(['b', 'a'])
Out[11]: 
         a   b
128hr   a1  b1
0hr     a5  b1
96hr   a12  b1
48hr    a2  b2
72hr   a10  b2

The Categoricalobject lets you define a sorting order for the DataFrameto use. The elements given when calling reorder_categoriesmust be unique, hence the call to setfor column "b".

Categorical对象允许您定义DataFrame要使用的排序顺序。调用时给出的元素reorder_categories必须是唯一的,因此调用set列“b”。

I leave it to the user to decide if this is better than the reindexmethod or not, since it requires you to sort the column data independently before sorting within the DataFrame(although I imagine that second sort is rather efficient).

我让用户来决定这是否比该reindex方法更好,因为它要求您在对列数据进行排序之前独立地对列数据进行排序DataFrame(尽管我认为第二种排序相当有效)。



Full disclosure, I am the natsortauthor.

完全公开,我是natsort作者。