自然排序 Pandas DataFrame
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29580978/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Naturally sorting Pandas DataFrame
提问by agf1997
I have a pandas DataFrame with indices I want to sort naturally. Natsort doesn't seem to work. Sorting the indices prior to building the DataFrame doesn't seem to help because the manipulations I do to the DataFrame seem to mess up the sorting in the process. Any thoughts on how I can resort the indices naturally?
我有一个带有我想要自然排序的索引的 Pandas DataFrame。Natsort 似乎不起作用。在构建 DataFrame 之前对索引进行排序似乎没有帮助,因为我对 DataFrame 所做的操作似乎在过程中弄乱了排序。关于如何自然地使用索引的任何想法?
from natsort import natsorted
import pandas as pd
# An unsorted list of strings
a = ['0hr', '128hr', '72hr', '48hr', '96hr']
# Sorted incorrectly
b = sorted(a)
# Naturally Sorted
c = natsorted(a)
# Use a as the index for a DataFrame
df = pd.DataFrame(index=a)
# Sorted Incorrectly
df2 = df.sort()
# Natsort doesn't seem to work
df3 = natsorted(df)
print(a)
print(b)
print(c)
print(df.index)
print(df2.index)
print(df3.index)
采纳答案by EdChum
If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:
如果要对 df 进行排序,只需对索引或数据进行排序并直接分配给 df 的索引,而不是尝试将 df 作为 arg 传递,因为这会产生一个空列表:
In [7]:
df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')
Note that df.index = natsorted(df.index)also works
请注意,这df.index = natsorted(df.index)也有效
if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:
如果您将 df 作为 arg 传递,它会产生一个空列表,在这种情况下,因为 df 是空的(没有列),否则它将返回排序的列,这不是您想要的:
In [10]:
natsorted(df)
Out[10]:
[]
EDIT
编辑
If you want to sort the index so that the data is reordered along with the index then use reindex:
如果要对索引进行排序以便数据与索引一起重新排序,请使用reindex:
In [13]:
df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
0
0hr 0
128hr 1
72hr 2
48hr 3
96hr 4
In [14]:
df = df*2
df
Out[14]:
0
0hr 0
128hr 2
72hr 4
48hr 6
96hr 8
In [15]:
df.reindex(index=natsorted(df.index))
Out[15]:
0
0hr 0
48hr 6
72hr 4
96hr 8
128hr 2
Note that you have to assign the result of reindexto either a new df or to itself, it does not accept the inplaceparam.
请注意,您必须将结果分配给reindex新的 df 或它本身,它不接受inplace参数。
回答by SethMMorton
The accepted answeranswers the question being asked. I'd like to also add how to use natsorton columns in a DataFrame, since that will be the next question asked.
该接受的答案回答所提出的问题。我还想添加如何natsort在 a 中的列上使用DataFrame,因为这将是下一个问题。
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted, index_natsorted, order_by_index
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df
Out[4]:
a b
0hr a5 b1
128hr a1 b1
72hr a10 b2
48hr a2 b2
96hr a12 b1
As the accepted answershows, sorting by the index is fairly straightforward:
正如接受的答案所示,按索引排序非常简单:
In [5]: df.reindex(index=natsorted(df.index))
Out[5]:
a b
0hr a5 b1
48hr a2 b2
72hr a10 b2
96hr a12 b1
128hr a1 b1
If you want to sort on a column in the same manner, you need to sort the index by the order that the desired column was reordered. natsortprovides the convenience functions index_natsortedand order_by_indexto do just that.
如果要以相同的方式对列进行排序,则需要按照所需列的重新排序顺序对索引进行排序。natsort提供便利的功能index_natsorted并order_by_index做到这一点。
In [6]: df.reindex(index=order_by_index(df.index, index_natsorted(df.a)))
Out[6]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [7]: df.reindex(index=order_by_index(df.index, index_natsorted(df.b)))
Out[7]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
If you want to reorder by an arbitrary number of columns (or a column and the index), you can use zip(or itertools.izipon Python2) to specify sorting on multiple columns. The first column given will be the primary sorting column, then secondary, then tertiary, etc...
如果要按任意数量的列(或列和索引)重新排序,可以使用zip(或itertools.izip在 Python2 上)指定对多列进行排序。给出的第一列将是主要排序列,然后是第二列,然后是第三列,等等......
In [8]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.a))))
Out[8]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
In [9]: df.reindex(index=order_by_index(df.index, index_natsorted(zip(df.b, df.index))))
Out[9]:
a b
0hr a5 b1
96hr a12 b1
128hr a1 b1
48hr a2 b2
72hr a10 b2
Here is an alternate method using Categoricalobjects that I have been told by the pandasdevs is the "proper" way to do this. This requires (as far as I can see) pandas >= 0.16.0. Currently, it only works on columns, but apparently in pandas >= 0.17.0 they will add CategoricalIndexwhich will allow this method to be used on an index.
这是使用开发Categorical人员告诉我的对象的替代方法pandas是执行此操作的“正确”方法。这需要(据我所知)pandas >= 0.16.0。目前,它仅适用于列,但显然在 Pandas >= 0.17.0 中,他们将添加CategoricalIndex这将允许在索引上使用此方法。
In [1]: from pandas import DataFrame
In [2]: from natsort import natsorted
In [3]: df = DataFrame({'a': ['a5', 'a1', 'a10', 'a2', 'a12'], 'b': ['b1', 'b1', 'b2', 'b2', 'b1']}, index=['0hr', '128hr', '72hr', '48hr', '96hr'])
In [4]: df.a = df.a.astype('category')
In [5]: df.a.cat.reorder_categories(natsorted(df.a), inplace=True, ordered=True)
In [6]: df.b = df.b.astype('category')
In [8]: df.b.cat.reorder_categories(natsorted(set(df.b)), inplace=True, ordered=True)
In [9]: df.sort('a')
Out[9]:
a b
128hr a1 b1
48hr a2 b2
0hr a5 b1
72hr a10 b2
96hr a12 b1
In [10]: df.sort('b')
Out[10]:
a b
0hr a5 b1
128hr a1 b1
96hr a12 b1
72hr a10 b2
48hr a2 b2
In [11]: df.sort(['b', 'a'])
Out[11]:
a b
128hr a1 b1
0hr a5 b1
96hr a12 b1
48hr a2 b2
72hr a10 b2
The Categoricalobject lets you define a sorting order for the DataFrameto use. The elements given when calling reorder_categoriesmust be unique, hence the call to setfor column "b".
该Categorical对象允许您定义DataFrame要使用的排序顺序。调用时给出的元素reorder_categories必须是唯一的,因此调用set列“b”。
I leave it to the user to decide if this is better than the reindexmethod or not, since it requires you to sort the column data independently before sorting within the DataFrame(although I imagine that second sort is rather efficient).
我让用户来决定这是否比该reindex方法更好,因为它要求您在对列数据进行排序之前独立地对列数据进行排序DataFrame(尽管我认为第二种排序相当有效)。
Full disclosure, I am the natsortauthor.
完全公开,我是natsort作者。

