Python 使用的字符串比较技术
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4806911/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
String comparison technique used by Python
提问by davelupt
I'm wondering how Python does string comparison, more specifically how it determines the outcome when a less than (<) or greater than (>) operator is used.
我想知道 Python 如何进行字符串比较,更具体地说,当使用小于 ( <) 或大于 ( >) 运算符时,它如何确定结果。
For instance if I put print('abc' < 'bac')I get True. I understand that it compares corresponding characters in the string, however its unclear as to why there is more, for lack of a better term, "weight" placed on the fact that a is less than b (first position) in first string rather than the fact that a is less than b in the second string (second position).
例如,如果我把print('abc' < 'bac')我得到True. 我知道它比较字符串中的相应字符,但是不清楚为什么有更多的,因为缺乏更好的术语,“权重”放在第一个字符串中 a 小于 b (第一个位置)而不是在第二个字符串(第二个位置)中 a 小于 b 的事实。
采纳答案by user225312
From the docs:
从文档:
The comparison uses lexicographical ordering: first the first two items are compared, and if they differ this determines the outcome of the comparison; if they are equal, the next two items are compared, and so on, until either sequence is exhausted.
比较使用字典顺序:首先比较前两项,如果它们不同,则决定比较的结果;如果它们相等,则比较接下来的两个项目,依此类推,直到用完任一序列。
Also:
还:
Lexicographical ordering for strings uses the Unicode code point number to order individual characters.
字符串的字典顺序使用 Unicode 代码点编号来对单个字符进行排序。
or on Python 2:
或在Python 2 上:
Lexicographical ordering for strings uses the ASCII ordering for individual characters.
字符串的字典顺序对单个字符使用 ASCII 顺序。
As an example:
举个例子:
>>> 'abc' > 'bac'
False
>>> ord('a'), ord('b')
(97, 98)
The result Falseis returned as soon as ais found to be less than b. The further items are not compared (as you can see for the second items: b> ais True).
False一旦a发现小于,就会返回结果b。不比较其他项目(正如您在第二个项目中看到的:b> ais True)。
Be aware of lower and uppercase:
注意小写和大写:
>>> [(x, ord(x)) for x in abc]
[('a', 97), ('b', 98), ('c', 99), ('d', 100), ('e', 101), ('f', 102), ('g', 103), ('h', 104), ('i', 105), ('j', 106), ('k', 107), ('l', 108), ('m', 109), ('n', 110), ('o', 111), ('p', 112), ('q', 113), ('r', 114), ('s', 115), ('t', 116), ('u', 117), ('v', 118), ('w', 119), ('x', 120), ('y', 121), ('z', 122)]
>>> [(x, ord(x)) for x in abc.upper()]
[('A', 65), ('B', 66), ('C', 67), ('D', 68), ('E', 69), ('F', 70), ('G', 71), ('H', 72), ('I', 73), ('J', 74), ('K', 75), ('L', 76), ('M', 77), ('N', 78), ('O', 79), ('P', 80), ('Q', 81), ('R', 82), ('S', 83), ('T', 84), ('U', 85), ('V', 86), ('W', 87), ('X', 88), ('Y', 89), ('Z', 90)]
回答by wkl
Python string comparison is lexicographic:
Python 字符串比较是按字典顺序的:
From Python Docs: http://docs.python.org/reference/expressions.html
来自 Python 文档:http: //docs.python.org/reference/expressions.html
Strings are compared lexicographically using the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior.
字符串使用其字符的等价数字(内置函数 ord() 的结果)按字典顺序进行比较。Unicode 和 8 位字符串在此行为中完全可互操作。
Hence in your example, 'abc' < 'bac', 'a' comes before (less-than) 'b' numerically (in ASCII and Unicode representations), so the comparison ends right there.
因此,在您的示例中,'abc' < 'bac', 'a' 在数字上(以 ASCII 和 Unicode 表示)出现在(小于)'b' 之前,因此比较就在那里结束。
回答by Michael J. Barber
This is a lexicographical ordering. It just puts things in dictionary order.
这是一个字典顺序。它只是将事物按字典顺序排列。
回答by Senthil Kumaran
Stringsare compared lexicographicallyusing the numeric equivalents (the result of the built-in function ord()) of their characters. Unicode and 8-bit strings are fully interoperable in this behavior.
字符串使用其字符的等价数字(内置函数 ord() 的结果)按字典顺序进行比较。Unicode 和 8 位字符串在此行为中完全可互操作。
回答by John Machin
Python and just about every other computer language use the same principles as (I hope) you would use when finding a word in a printed dictionary:
Python 和几乎所有其他计算机语言都使用与(我希望)您在印刷字典中查找单词时使用的相同原则:
(1) Depending on the human language involved, you have a notion of character ordering: 'a' < 'b' < 'c' etc
(1) 根据所涉及的人类语言,您有字符排序的概念:'a' < 'b' < 'c' 等
(2) First character has more weight than second character: 'az' < 'za' (whether the language is written left-to-right or right-to-left or boustrophedon is quite irrelevant)
(2) 第一个字符比第二个字符有更多的权重:'az' < 'za'(语言是从左到右或从右到左或 boustrophedon 写的都无关紧要)
(3) If you run out of characters to test, the shorter string is less than the longer string: 'foo' < 'food'
(3) 如果用完要测试的字符,则较短的字符串小于较长的字符串:'foo' < 'food'
Typically, in a computer language the "notion of character ordering" is rather primitive: each character has a human-language-independent number ord(character)and characters are compared and sorted using that number. Often that ordering is not appropriate to the human language of the user, and then you need to get into "collating", a fun topic.
通常,在计算机语言中,“字符排序的概念”相当原始:每个字符都有一个独立于人类语言的数字ord(character),并且使用该数字对字符进行比较和排序。通常这种排序不适合用户的人类语言,然后您需要进入“整理”,这是一个有趣的话题。
回答by yannis
Take a look also at How do I sort unicode strings alphabetically in Python?where the discussion is about sorting rules given by the Unicode Collation Algorithm (http://www.unicode.org/reports/tr10/).
另请参阅如何在 Python 中按字母顺序对 unicode 字符串进行排序?其中讨论的是 Unicode Collation Algorithm ( http://www.unicode.org/reports/tr10/)给出的排序规则。
To reply to the comment
回复评论
What? How else can ordering be defined other than left-to-right?
什么?除了从左到右之外,还可以如何定义排序?
by S.Lott, there is a famous counter-example when sorting French language. It involves accents: indeed, one could say that, in French, letters are sorted left-to-right and accents right-to-left. Here is the counter-example: we have e < é and o < ?, so you would expect the words cote, coté, c?te, c?té to be sorted as cote < coté < c?te < c?té. Well, this is not what happens, in fact you have: cote < c?te < coté < c?té, i.e., if we remove "c" and "t", we get oe < ?e < oé < ?é, which is exactly right-to-left ordering.
由 S.Lott 撰写,在对法语进行排序时有一个著名的反例。它涉及重音:事实上,可以说,在法语中,字母从左到右排序,重音从右到左排序。这是反例:我们有 e < é 和 o < ?,所以你会期望单词 cote、coté、c?te、c?té 被排序为 cote < coté < c?te < c?té。好吧,这不是发生的事情,事实上你有:cote < c?te < coté < c?té,即,如果我们删除“c”和“t”,我们得到 oe < ?e < oé < ?é,这正是从右到左的顺序。
And a last remark: you shouldn't be talking about left-to-rightand right-to-leftsorting but rather about forwardand backwardsorting.
最后一句话:你不应该谈论从左到右和从右到左的排序,而应该谈论向前和向后排序。
Indeed there are languages written from right to left and if you think Arabic and Hebrew are sorted right-to-leftyou may be right from a graphical point of view, but you are wrong on the logical level!
确实,有些语言是从右到左书写的,如果您认为阿拉伯语和希伯来语是从右到左排列的,那么从图形的角度来看您可能是对的,但在逻辑层面上您就错了!
Indeed, Unicode considers character strings encoded in logical order, and writing direction is a phenomenon occurring on the glyph level. In other words, even if in the word ???? the letter shin appears on the right of the lamed, logicallyit occurs beforeit. To sort this word one will first consider the shin, then the lamed, then the vav, then the mem, and this is forwardordering (although Hebrew is written right-to-left), while French accents are sorted backwards(although French is written left-to-right).
确实,Unicode 考虑的是按逻辑顺序编码的字符串,而书写方向是发生在字形级别的现象。换句话说,即使在字里行???字母 shin 出现在 lamed 的右边,逻辑上它出现在它之前。要对这个词进行排序,首先要考虑 shin,然后是 lamed,然后是 vav,然后是 mem,这是向前排序(尽管希伯来语是从右到左写的),而法语口音是向后排序的(尽管法语是从左到右写)。
回答by MSeifert
A pure Python equivalent for string comparisons would be:
字符串比较的纯 Python 等效项是:
def less(string1, string2):
# Compare character by character
for idx in range(min(len(string1), len(string2))):
# Get the "value" of the character
ordinal1, ordinal2 = ord(string1[idx]), ord(string2[idx])
# If the "value" is identical check the next characters
if ordinal1 == ordinal2:
continue
# If it's smaller we're finished and can return True
elif ordinal1 < ordinal2:
return True
# If it's bigger we're finished and return False
else:
return False
# We're out of characters and all were equal, so the result depends on the length
# of the strings.
return len(string1) < len(string2)
This function does the equivalent of the real method (Python 3.6and Python 2.7) just a lot slower. Also note that the implementation isn't exactly "pythonic" and only works for <comparisons. It's just to illustrate how it works. I haven't checked if it works like Pythons comparison for combined unicode characters.
这个函数相当于真正的方法(Python 3.6和Python 2.7),只是慢了很多。另请注意,该实现并不完全是“pythonic”,仅适用于<比较。这只是为了说明它是如何工作的。我还没有检查它是否像 Pythons compare for combine unicode characters 一样工作。
A more general variant would be:
更一般的变体是:
from operator import lt, gt
def compare(string1, string2, less=True):
op = lt if less else gt
for char1, char2 in zip(string1, string2):
ordinal1, ordinal2 = ord(char1), ord(char1)
if ordinal1 == ordinal2:
continue
elif op(ordinal1, ordinal2):
return True
else:
return False
return op(len(string1), len(string2))

