Python sum,为什么不是字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3525359/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python sum, why not strings?
提问by Muhammad Alkarouri
Python has a built in function sum, which is effectively equivalent to:
Python 有一个内置函数sum,它实际上等效于:
def sum2(iterable, start=0):
return start + reduce(operator.add, iterable)
for all types of parameters except strings. It works for numbers and lists, for example:
适用于除字符串以外的所有类型的参数。它适用于数字和列表,例如:
sum([1,2,3], 0) = sum2([1,2,3],0) = 6 #Note: 0 is the default value for start, but I include it for clarity
sum({888:1}, 0) = sum2({888:1},0) = 888
Why were strings specially left out?
为什么字符串被特别遗漏了?
sum( ['foo','bar'], '') # TypeError: sum() can't sum strings [use ''.join(seq) instead]
sum2(['foo','bar'], '') = 'foobar'
I seem to remember discussions in the Python list for the reason, so an explanation or a link to a thread explaining it would be fine.
由于这个原因,我似乎记得 Python 列表中的讨论,因此解释或链接到解释它的线程会很好。
Edit: I am aware that the standard way is to do "".join. My question is why the option of using sum for strings was banned, and no banning was there for, say, lists.
编辑:我知道标准的方法是做"".join。我的问题是为什么禁止对字符串使用 sum 的选项,并且没有禁止,例如列表。
Edit 2: Although I believe this is not needed given all the good answers I got, the question is: Why does sum work on an iterable containing numbers or an iterable containing lists but not an iterable containing strings?
编辑 2:虽然我认为鉴于我得到的所有好的答案不需要这样做,但问题是:为什么 sum 对包含数字的可迭代对象或包含列表的可迭代对象起作用,但对包含字符串的可迭代对象不起作用?
采纳答案by rbp
Python tries to discourage you from "summing" strings. You're supposed to join them:
Python 试图阻止您对字符串进行“求和”。你应该加入他们:
"".join(list_of_strings)
It's a lot faster, and uses much less memory.
它要快得多,而且使用的内存要少得多。
A quick benchmark:
快速基准:
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = reduce(operator.add, strings)'
100 loops, best of 3: 8.46 msec per loop
$ python -m timeit -s 'import operator; strings = ["a"]*10000' 'r = "".join(strings)'
1000 loops, best of 3: 296 usec per loop
Edit (to answer OP's edit): As to why strings were apparently "singled out", I believe it's simply a matter of optimizing for a common case, as well as of enforcing best practice: you can join strings much faster with ''.join, so explicitly forbidding strings on sumwill point this out to newbies.
编辑(回答 OP 的编辑):至于为什么字符串显然被“挑出”,我认为这只是针对常见情况进行优化以及强制执行最佳实践的问题:您可以使用 '' 更快地连接字符串。加入,因此明确禁止字符串sum会向新手指出这一点。
BTW, this restriction has been in place "forever", i.e., since the sumwas added as a built-in function (rev. 32347)
顺便说一句,此限制已“永远”存在,即,自从sum作为内置函数添加以来(修订版 32347)
回答by unutbu
回答by Debilski
Edit:Moved the parts about immutability to history.
编辑:将关于不可变性的部分移到历史中。
Basically, its a question of preallocation. When you use a statement such as
基本上,这是一个预分配的问题。当您使用诸如
sum(["a", "b", "c", ..., ])
and expect it to work similar to a reducestatement, the code generated looks something like
并期望它reduce像语句一样工作,生成的代码看起来像
v1 = "" + "a" # must allocate v1 and set its size to len("") + len("a")
v2 = v1 + "b" # must allocate v2 and set its size to len("a") + len("b")
...
res = v10000 + "$" # must allocate res and set its size to len(v9999) + len("$")
In each of these steps a new string is created, which for one might give some copying overhead as the strings are getting longer and longer. But that's maybe not the point here. What's more important, is that every new string on each line must be allocatedto it's specific size (which. I don't know it it must allocate in every iteration of the reducestatement, there might be some obvious heuristics to use and Python might allocate a bit more here and there for reuse – but at several points the new string will be large enough that this won't help anymore and Python must allocate again, which is rather expensive.
在这些步骤中的每一步中,都会创建一个新字符串,随着字符串越来越长,这可能会带来一些复制开销。但这可能不是重点。更重要的是,每一行上的每个新字符串都必须分配给它特定的大小(我不知道它必须在reduce语句的每次迭代中分配,可能有一些明显的启发式方法可供使用,Python 可能会分配在这里和那里多一点以供重用 - 但在某些时候,新字符串将足够大,这将不再有用,Python 必须再次分配,这是相当昂贵的。
A dedicated method like join, however has the job to figure out the real size of the string before it starts and would therefore in theory only allocate once, at the beginning and then just fill that new string, which is much cheaper than the other solution.
join然而,像 那样的专用方法需要在开始之前确定字符串的实际大小,因此理论上只会在开始时分配一次,然后只填充新字符串,这比其他解决方案便宜得多。
回答by u0b34a0f6ae
You can in fact use sum(..)to concatenate strings, if you use the appropriate starting object! Of course, if you go this far you have already understood enough to use "".join(..)anyway..
sum(..)如果您使用适当的起始对象,您实际上可以使用连接字符串!当然,如果你走到这一步,你已经足够理解了,"".join(..)无论如何都可以使用..
>>> class ZeroObject(object):
... def __add__(self, other):
... return other
...
>>> sum(["hi", "there"], ZeroObject())
'hithere'
回答by dan04
Short answer: Efficiency.
简短的回答:效率。
Long answer: The sumfunction has to create an object for each partial sum.
长答案:该sum函数必须为每个部分和创建一个对象。
Assume that the amount of time required to create an object is directly proportional to the size of its data. Let N denote the number of elements in the sequence to sum.
假设创建一个对象所需的时间与其数据的大小成正比。让 N 表示序列中要求和的元素数。
doubles are always the same size, which makes sum's running time O(1)×N = O(N).
doubles 总是相同的大小,这使得sum的运行时间 O(1)×N = O(N)。
int(formerly known as long) is arbitary-length. Let M denote the absolute value of the largest sequence element. Then sum's worst-case running time is lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log N).
int(以前称为long)是任意长度的。让 M 表示最大序列元素的绝对值。那么sum最坏情况的运行时间是 lg(M) + lg(2M) + lg(3M) + ... + lg(NM) = N×lg(M) + lg(N!) = O(N log否)。
For str(where M = the length of the longest string), the worst-case running time is M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N2).
对于str(其中 M = 最长字符串的长度),最坏情况的运行时间为 M + 2M + 3M + ... + NM = M×(1 + 2 + ... + N) = O(N2).
Thus, summing strings would be much slower than summing numbers.
因此,summing 字符串将比summing 数字慢得多。
str.joindoes not allocate any intermediate objects. It preallocates a buffer large enough to hold the joined strings, and copies the string data. It runs in O(N)time, much faster than sum.
str.join不分配任何中间对象。它预先分配一个足够大的缓冲区来保存连接的字符串,并复制字符串数据。它在O(N)时间内运行,比sum.
回答by HS.
Here's the source: http://svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
这是来源:http: //svn.python.org/view/python/trunk/Python/bltinmodule.c?revision=81029&view=markup
In the builtin_sum function we have this bit of code:
在 builtin_sum 函数中,我们有以下代码:
/* reject string values for 'start' parameter */
if (PyObject_TypeCheck(result, &PyBaseString_Type)) {
PyErr_SetString(PyExc_TypeError,
"sum() can't sum strings [use ''.join(seq) instead]");
Py_DECREF(iter);
return NULL;
}
Py_INCREF(result);
}
So.. that's your answer.
所以..这就是你的答案。
It's explicitly checked in the code and rejected.
它在代码中明确检查并被拒绝。
回答by Ethan Furman
The Reason Why
之所以
@dan04 has an excellent explanation for the costs of using sumon large lists of strings.
@dan04sum对在大型字符串列表上使用的成本有很好的解释。
The missing piece as to why stris not allowed for sumis that many, many people were trying to use sumfor strings, and not many use sumfor lists and tuples and other O(n**2) data structures. The trap is that sumworks just fine for short lists of strings, but then gets put in production where the lists can be huge, and the performance slows to a crawl. This was such a common trap that the decision was made to ignore duck-typing in this instance, and not allow strings to be used with sum.
关于为什么str不允许使用的缺失部分sum是,很多人试图将其sum用于字符串,而sum用于列表和元组以及其他 O(n**2) 数据结构的人并不多。陷阱是,sum对于短的字符串列表来说效果很好,但随后会投入生产,其中列表可能很大,并且性能会变慢到爬行。这是一个如此常见的陷阱,以至于在这种情况下决定忽略鸭子类型,并且不允许将字符串与sum.
回答by Dinesh Panchananam
I dont know why, but this works!
我不知道为什么,但这有效!
import operator
def sum_of_strings(list_of_strings):
return reduce(operator.add, list_of_strings)

