Python 是否有用于字符串自然排序的内置函数?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4836710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is there a built in function for string natural sort?
提问by snakile
Using Python 3.x, I have a list of strings for which I would like to perform a natural alphabetical sort.
使用 Python 3.x,我有一个字符串列表,我想对其执行自然的字母排序。
Natural sort:The order by which files in Windows are sorted.
自然排序:Windows 中文件的排序顺序。
For instance, the following list is naturally sorted (what I want):
例如,下面的列表是自然排序的(我想要的):
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
And here's the "sorted" version of the above list (what I have):
这是上述列表的“排序”版本(我拥有的):
['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
I'm looking for a sort function which behaves like the first one.
我正在寻找一个排序函数,它的行为类似于第一个。
采纳答案by SethMMorton
There is a third party library for this on PyPI called natsort(full disclosure, I am the package's author). For your case, you can do either of the following:
PyPI 上有一个名为natsort 的第三方库(完全公开,我是该包的作者)。对于您的情况,您可以执行以下任一操作:
>>> from natsort import natsorted, ns
>>> x = ['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
>>> natsorted(x, key=lambda y: y.lower())
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsorted(x, alg=ns.IGNORECASE) # or alg=ns.IC
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
You should note that natsortuses a general algorithm so it should work for just about any input that you throw at it. If you want more details on why you might choose a library to do this rather than rolling your own function, check out the natsortdocumentation's How It Workspage, in particular the Special Cases Everywhere!section.
您应该注意它natsort使用通用算法,因此它应该适用于您投入的任何输入。如果您想了解更多关于为什么选择一个库来执行此操作而不是滚动您自己的函数的详细信息,请查看natsort文档的“如何工作”页面,特别是随处可见的特殊情况!部分。
If you need a sorting key instead of a sorting function, use either of the below formulas.
如果您需要排序键而不是排序函数,请使用以下任一公式。
>>> from natsort import natsort_keygen, ns
>>> l1 = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> l2 = l1[:]
>>> natsort_key1 = natsort_keygen(key=lambda y: y.lower())
>>> l1.sort(key=natsort_key1)
>>> l1
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsort_key2 = natsort_keygen(alg=ns.IGNORECASE)
>>> l2.sort(key=natsort_key2)
>>> l2
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
回答by Mark Byers
Try this:
尝试这个:
import re
def natural_sort(l):
convert = lambda text: int(text) if text.isdigit() else text.lower()
alphanum_key = lambda key: [ convert(c) for c in re.split('([0-9]+)', key) ]
return sorted(l, key = alphanum_key)
Output:
输出:
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
Code adapted from here: Sorting for Humans : Natural Sort Order.
代码改编自此处:人类排序:自然排序顺序。
回答by SilentGhost
>>> import re
>>> sorted(lst, key=lambda x: int(re.findall(r'\d+$', x)[0]))
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
回答by robert king
One option is to turn the string into a tuple and replace digits using expanded form http://wiki.answers.com/Q/What_does_expanded_form_mean
一种选择是将字符串转换为元组并使用扩展形式替换数字http://wiki.answers.com/Q/What_does_expanded_form_mean
that way a90 would become ("a",90,0) and a1 would become ("a",1)
这样 a90 会变成 ("a",90,0) 并且 a1 会变成 ("a",1)
below is some sample code (which isn't very efficient due to the way It removes leading 0's from numbers)
下面是一些示例代码(由于它从数字中删除前导 0 的方式,效率不是很高)
alist=["something1",
"something12",
"something17",
"something2",
"something25and_then_33",
"something25and_then_34",
"something29",
"beta1.1",
"beta2.3.0",
"beta2.33.1",
"a001",
"a2",
"z002",
"z1"]
def key(k):
nums=set(list("0123456789"))
chars=set(list(k))
chars=chars-nums
for i in range(len(k)):
for c in chars:
k=k.replace(c+"0",c)
l=list(k)
base=10
j=0
for i in range(len(l)-1,-1,-1):
try:
l[i]=int(l[i])*base**j
j+=1
except:
j=0
l=tuple(l)
print l
return l
print sorted(alist,key=key)
output:
输出:
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 1)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 10, 7)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 2)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 3)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 5, 'a', 'n', 'd', '_', 't', 'h', 'e', 'n', '_', 30, 4)
('s', 'o', 'm', 'e', 't', 'h', 'i', 'n', 'g', 20, 9)
('b', 'e', 't', 'a', 1, '.', 1)
('b', 'e', 't', 'a', 2, '.', 3, '.')
('b', 'e', 't', 'a', 2, '.', 30, 3, '.', 1)
('a', 1)
('a', 2)
('z', 2)
('z', 1)
['a001', 'a2', 'beta1.1', 'beta2.3.0', 'beta2.33.1', 'something1', 'something2', 'something12', 'something17', 'something25and_then_33', 'something25and_then_34', 'something29', 'z1', 'z002']
回答by beauburrier
I wrote a function based on http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.htmlwhich adds the ability to still pass in your own 'key' parameter. I need this in order to perform a natural sort of lists that contain more complex objects (not just strings).
我写了一个基于http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html的函数,它增加了仍然传递你自己的“key”参数的能力。我需要这个来执行包含更复杂对象(不仅仅是字符串)的自然列表。
import re
def natural_sort(list, key=lambda s:s):
"""
Sort the list into natural alphanumeric order.
"""
def get_alphanum_key_func(key):
convert = lambda text: int(text) if text.isdigit() else text
return lambda s: [convert(c) for c in re.split('([0-9]+)', key(s))]
sort_key = get_alphanum_key_func(key)
list.sort(key=sort_key)
For example:
例如:
my_list = [{'name':'b'}, {'name':'10'}, {'name':'a'}, {'name':'1'}, {'name':'9'}]
natural_sort(my_list, key=lambda x: x['name'])
print my_list
[{'name': '1'}, {'name': '9'}, {'name': '10'}, {'name': 'a'}, {'name': 'b'}]
回答by Claudiu
Here's a much more pythonic version of Mark Byer's answer:
这是 Mark Byer 答案的更pythonic 版本:
import re
def natural_sort_key(s, _nsre=re.compile('([0-9]+)')):
return [int(text) if text.isdigit() else text.lower()
for text in _nsre.split(s)]
Now this function can be used as a key in any function that uses it, like list.sort, sorted, max, etc.
现在,这个功能可以作为在使用它的任何功能,就象是一把钥匙list.sort,sorted,max,等。
As a lambda:
作为一个 lambda:
lambda s: [int(t) if t.isdigit() else t.lower() for t in re.split('(\d+)', s)]
回答by Scott Lawton
The above answers are good for the specific examplethat was shown, but miss several useful cases for the more general question of natural sort. I just got bit by one of those cases, so created a more thorough solution:
上面的答案适用于显示的特定示例,但对于更一般的自然排序问题错过了几个有用的案例。我只是对其中一个案例有所了解,因此创建了一个更彻底的解决方案:
def natural_sort_key(string_or_number):
"""
by Scott S. Lawton <[email protected]> 2014-12-11; public domain and/or CC0 license
handles cases where simple 'int' approach fails, e.g.
['0.501', '0.55'] floating point with different number of significant digits
[0.01, 0.1, 1] already numeric so regex and other string functions won't work (and aren't required)
['elm1', 'Elm2'] ASCII vs. letters (not case sensitive)
"""
def try_float(astring):
try:
return float(astring)
except:
return astring
if isinstance(string_or_number, basestring):
string_or_number = string_or_number.lower()
if len(re.findall('[.]\d', string_or_number)) <= 1:
# assume a floating point value, e.g. to correctly sort ['0.501', '0.55']
# '.' for decimal is locale-specific, e.g. correct for the Anglosphere and Asia but not continental Europe
return [try_float(s) for s in re.split(r'([\d.]+)', string_or_number)]
else:
# assume distinct fields, e.g. IP address, phone number with '.', etc.
# caveat: might want to first split by whitespace
# TBD: for unicode, replace isdigit with isdecimal
return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_or_number)]
else:
# consider: add code to recurse for lists/tuples and perhaps other iterables
return string_or_number
Test code and several links (on and off of StackOverflow) are here: http://productarchitect.com/code/better-natural-sort.py
测试代码和几个链接(StackOverflow 上的和关闭的)在这里:http: //productarchitect.com/code/better-natural-sort.py
Feedback welcome. That's not meant to be a definitive solution; just a step forward.
欢迎反馈。这并不意味着是一个最终的解决方案。只是向前迈出了一步。
回答by SergO
data = ['elm13', 'elm9', 'elm0', 'elm1', 'Elm11', 'Elm2', 'elm10']
Let's analyse the data. The digit capacity of all elements is 2. And there are 3 letters in common literal part 'elm'.
我们来分析一下数据。所有元素的位数为2。公共文字部分有3个字母'elm'。
So, the maximal length of element is 5. We can increase this value to make sure (for example, to 8).
因此,元素的最大长度为 5。我们可以增加此值以确保(例如,增加到 8)。
Bearing that in mind, we've got a one-line solution:
考虑到这一点,我们有一个单行解决方案:
data.sort(key=lambda x: '{0:0>8}'.format(x).lower())
without regular expressions and external libraries!
没有正则表达式和外部库!
print(data)
>>> ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'elm13']
Explanation:
解释:
for elm in data:
print('{0:0>8}'.format(elm).lower())
>>>
0000elm0
0000elm1
0000elm2
0000elm9
000elm10
000elm11
000elm13
回答by JerodG
There are many implementations out there, and while some have come close, none quite captured the elegance modern python affords.
有许多实现,虽然有些已经接近,但没有一个能完全捕捉到现代 python 提供的优雅。
- Tested using python(3.5.1)
- Included an additional list to demonstrate that it works when the numbers are mid string
- Didn't test, however, I am assuming that if your list was sizable it would be more efficient to compile the regex beforehand
- I'm sure someone will correct me if this is an erroneous assumption
- 使用 python(3.5.1) 测试
- 包括一个额外的列表来证明它在数字是中间字符串时有效
- 没有测试,但是,我假设如果您的列表很大,那么事先编译正则表达式会更有效
- 如果这是一个错误的假设,我相信有人会纠正我
from re import compile, split
dre = compile(r'(\d+)')
mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
Full-Code完整代码#!/usr/bin/python3
# coding=utf-8
"""
Natural-Sort Test
"""
from re import compile, split
dre = compile(r'(\d+)')
mylist = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13', 'elm']
mylist2 = ['e0lm', 'e1lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm', 'e01lm']
mylist.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
mylist2.sort(key=lambda l: [int(s) if s.isdigit() else s.lower() for s in split(dre, l)])
print(mylist)
# ['elm', 'elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
print(mylist2)
# ['e0lm', 'e1lm', 'e01lm', 'E2lm', 'e9lm', 'e10lm', 'E12lm', 'e13lm', 'elm']
Cautionwhen using
使用时注意
from os.path import split- you will need to differentiate the imports
from os.path import split- 您需要区分进口
Inspirationfrom
灵感来自
- Python Documentation- Sorting HOW TO
- Sorting for Humans : Natural Sort Order
- Human Sorting
- Contributors/Commentators to this and referenced posts
- Python 文档 - 排序方式
- 人类排序:自然排序
- 人类分拣
- 对此和引用的帖子的贡献者/评论者
回答by user19087
Most likely functools.cmp_to_key()is closely tied to the underlying implementation of python's sort. Besides, the cmpparameter is legacy. The modern way is to transform the input items into objects that support the desired rich comparison operations.
很可能functools.cmp_to_key()与 python 排序的底层实现密切相关。此外,cmp参数是遗留的。现代方法是将输入项转换为支持所需丰富比较操作的对象。
Under CPython 2.x, objects of disparate types can be ordered even if the respective rich comparison operators haven't been implemented. Under CPython 3.x, objects of different types must explicitly support the comparison. See How does Python compare string and int?which links to the official documentation. Most of the answers depend on this implicit ordering. Switching to Python 3.x will require a new type to implement and unify comparisons between numbers and strings.
在 CPython 2.x 下,即使尚未实现相应的丰富比较运算符,也可以对不同类型的对象进行排序。在 CPython 3.x 下,不同类型的对象必须明确支持比较。请参见Python 如何比较 string 和 int?其中链接到官方文档。大多数答案取决于这种隐式排序。切换到 Python 3.x 将需要一种新类型来实现和统一数字和字符串之间的比较。
Python 2.7.12 (default, Sep 29 2016, 13:30:34)
>>> (0,"foo") < ("foo",0)
True
Python 3.5.2 (default, Oct 14 2016, 12:54:53)
>>> (0,"foo") < ("foo",0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unorderable types: int() < str()
There are three different approaches. The first uses nested classes to take advantage of Python's Iterablecomparison algorithm. The second unrolls this nesting into a single class. The third foregoes subclassing strto focus on performance. All are timed; the second is twice as fast while the third almost six times faster. Subclassing strisn't required, and was probably a bad idea in the first place, but it does come with certain conveniences.
存在三种不同的方法。第一个使用嵌套类来利用 Python 的Iterable比较算法。第二个将这个嵌套展开为一个类。第三个放弃子类化str以关注性能。都是定时的;第二个快两倍,而第三个快六倍。子类化str不是必需的,一开始可能是个坏主意,但它确实带来了某些便利。
The sort characters are duplicated to force ordering by case, and case-swapped to force lower case letter to sort first; this is the typical definition of "natural sort". I couldn't decide on the type of grouping; some might prefer the following, which also brings significant performance benefits:
排序字符被复制以强制按大小写排序,大小写交换以强制小写字母先排序;这是“自然排序”的典型定义。我无法决定分组的类型;有些人可能更喜欢以下内容,这也带来了显着的性能优势:
d = lambda s: s.lower()+s.swapcase()
Where utilized, the comparison operators are set to that of objectso they won't be ignored by functools.total_ordering.
在使用时,比较运算符设置为 的,object因此它们不会被functools.total_ordering.
import functools
import itertools
@functools.total_ordering
class NaturalStringA(str):
def __repr__(self):
return "{}({})".format\
( type(self).__name__
, super().__repr__()
)
d = lambda c, s: [ c.NaturalStringPart("".join(v))
for k,v in
itertools.groupby(s, c.isdigit)
]
d = classmethod(d)
@functools.total_ordering
class NaturalStringPart(str):
d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
d = staticmethod(d)
def __lt__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
try:
return int(self) < int(other)
except ValueError:
if self.isdigit():
return True
elif other.isdigit():
return False
else:
return self.d(self) < self.d(other)
def __eq__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
try:
return int(self) == int(other)
except ValueError:
if self.isdigit() or other.isdigit():
return False
else:
return self.d(self) == self.d(other)
__le__ = object.__le__
__ne__ = object.__ne__
__gt__ = object.__gt__
__ge__ = object.__ge__
def __lt__(self, other):
return self.d(self) < self.d(other)
def __eq__(self, other):
return self.d(self) == self.d(other)
__le__ = object.__le__
__ne__ = object.__ne__
__gt__ = object.__gt__
__ge__ = object.__ge__
import functools
import itertools
@functools.total_ordering
class NaturalStringB(str):
def __repr__(self):
return "{}({})".format\
( type(self).__name__
, super().__repr__()
)
d = lambda s: "".join(c.lower()+c.swapcase() for c in s)
d = staticmethod(d)
def __lt__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
zipped = itertools.zip_longest(*groups)
for s,o in zipped:
if s is None:
return True
if o is None:
return False
s_k, s_v = s[0], "".join(s[1])
o_k, o_v = o[0], "".join(o[1])
if s_k and o_k:
s_v, o_v = int(s_v), int(o_v)
if s_v == o_v:
continue
return s_v < o_v
elif s_k:
return True
elif o_k:
return False
else:
s_v, o_v = self.d(s_v), self.d(o_v)
if s_v == o_v:
continue
return s_v < o_v
return False
def __eq__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
groups = map(lambda i: itertools.groupby(i, type(self).isdigit), (self, other))
zipped = itertools.zip_longest(*groups)
for s,o in zipped:
if s is None or o is None:
return False
s_k, s_v = s[0], "".join(s[1])
o_k, o_v = o[0], "".join(o[1])
if s_k and o_k:
s_v, o_v = int(s_v), int(o_v)
if s_v == o_v:
continue
return False
elif s_k or o_k:
return False
else:
s_v, o_v = self.d(s_v), self.d(o_v)
if s_v == o_v:
continue
return False
return True
__le__ = object.__le__
__ne__ = object.__ne__
__gt__ = object.__gt__
__ge__ = object.__ge__
import functools
import itertools
import enum
class OrderingType(enum.Enum):
PerWordSwapCase = lambda s: s.lower()+s.swapcase()
PerCharacterSwapCase = lambda s: "".join(c.lower()+c.swapcase() for c in s)
class NaturalOrdering:
@classmethod
def by(cls, ordering):
def wrapper(string):
return cls(string, ordering)
return wrapper
def __init__(self, string, ordering=OrderingType.PerCharacterSwapCase):
self.string = string
self.groups = [ (k,int("".join(v)))
if k else
(k,ordering("".join(v)))
for k,v in
itertools.groupby(string, str.isdigit)
]
def __repr__(self):
return "{}({})".format\
( type(self).__name__
, self.string
)
def __lesser(self, other, default):
if not isinstance(self, type(other)):
return NotImplemented
for s,o in itertools.zip_longest(self.groups, other.groups):
if s is None:
return True
if o is None:
return False
s_k, s_v = s
o_k, o_v = o
if s_k and o_k:
if s_v == o_v:
continue
return s_v < o_v
elif s_k:
return True
elif o_k:
return False
else:
if s_v == o_v:
continue
return s_v < o_v
return default
def __lt__(self, other):
return self.__lesser(other, default=False)
def __le__(self, other):
return self.__lesser(other, default=True)
def __eq__(self, other):
if not isinstance(self, type(other)):
return NotImplemented
for s,o in itertools.zip_longest(self.groups, other.groups):
if s is None or o is None:
return False
s_k, s_v = s
o_k, o_v = o
if s_k and o_k:
if s_v == o_v:
continue
return False
elif s_k or o_k:
return False
else:
if s_v == o_v:
continue
return False
return True
# functools.total_ordering doesn't create single-call wrappers if both
# __le__ and __lt__ exist, so do it manually.
def __gt__(self, other):
op_result = self.__le__(other)
if op_result is NotImplemented:
return op_result
return not op_result
def __ge__(self, other):
op_result = self.__lt__(other)
if op_result is NotImplemented:
return op_result
return not op_result
# __ne__ is the only implied ordering relationship, it automatically
# delegates to __eq__
>>> import natsort
>>> import timeit
>>> l1 = ['Apple', 'corn', 'apPlE', 'arbour', 'Corn', 'Banana', 'apple', 'banana']
>>> l2 = list(map(str, range(30)))
>>> l3 = ["{} {}".format(x,y) for x in l1 for y in l2]
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringA)', number=10000, globals=globals()))
362.4729259099986
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalStringB)', number=10000, globals=globals()))
189.7340817489967
>>> print(timeit.timeit('sorted(l3+["0"], key=NaturalOrdering.by(OrderingType.PerCharacterSwapCase))', number=10000, globals=globals()))
69.34636392899847
>>> print(timeit.timeit('natsort.natsorted(l3+["0"], alg=natsort.ns.GROUPLETTERS | natsort.ns.LOWERCASEFIRST)', number=10000, globals=globals()))
98.2531585780016
Natural sorting is both pretty complicated and vaguely defined as a problem. Don't forget to run unicodedata.normalize(...)beforehand, and consider use str.casefold()rather than str.lower(). There are probably subtle encoding issues I haven't considered. So I tentatively recommend the natsortlibrary. I took a quick glance at the github repository; the code maintenance has been stellar.
自然排序既非常复杂,又被模糊地定义为一个问题。不要忘记unicodedata.normalize(...)预先运行,并考虑使用str.casefold()而不是str.lower(). 我可能没有考虑到一些微妙的编码问题。所以我暂时推荐了natsort库。我快速浏览了 github 存储库;代码维护非常出色。
All the algorithms I've seen depend on tricks such as duplicating and lowering characters, and swapping case. While this doubles the running time, an alternative would require a total natural ordering on the input character set. I don't think this is part of the unicode specification, and since there are many more unicode digits than [0-9], creating such a sorting would be equally daunting. If you want locale-aware comparisons, prepare your strings with locale.strxfrmper Python's Sorting HOW TO.
我见过的所有算法都依赖于技巧,例如复制和降低字符以及交换大小写。虽然这会使运行时间加倍,但另一种方法是需要对输入字符集进行完全自然的排序。我不认为这是 unicode 规范的一部分,并且由于 unicode 数字比 多[0-9],因此创建这样的排序同样令人生畏。如果您想要了解语言环境的比较,请locale.strxfrm按照 Python 的Sorting HOW TO准备您的字符串。

