在 Python 中拆分空字符串时,为什么 split() 返回空列表而 split('\n') 返回 ['']?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16645083/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:15:35  来源:igfitidea点击:

When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']?

pythonstringalgorithmparsingsplit

提问by godice

I am using split('\n')to get lines in one string, and found that ''.split()returns an empty list, [], while ''.split('\n')returns ['']. Is there any specific reason for such a difference?

我正在使用split('\n')在一个字符串中获取行,发现它''.split()返回一个空列表[],而''.split('\n')返回['']。这种差异有什么具体原因吗?

And is there any more convenient way to count lines in a string?

有没有更方便的方法来计算字符串中的行数?

采纳答案by Raymond Hettinger

Question: I am using split('\n') to get lines in one string, and found that ''.split() returns empty list [], while ''.split('\n') returns [''].

问题:我使用 split('\n') 获取一个字符串中的行,发现 ''.split() 返回空列表 [],而 ''.split('\n') 返回 [''] .

The str.split()method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

所述str.split()方法有两种算法。如果没有给出参数,它会在重复运行空格时拆分。但是,如果给出参数,则将其视为没有重复运行的单个分隔符。

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

在拆分空字符串的情况下,第一种模式(无参数)将返回一个空列表,因为空格被占用并且结果列表中没有值可放入。

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

相反,第二种模式(带有诸如 的参数\n)将产生第一个空字段。考虑一下如果你写了'\n'.split('\n'),你会得到两个字段(一个分割,给你两半)。

Question: Is there any specific reason for such a difference?

问题:这种差异有什么具体原因吗?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

当数据在具有可变数量空白的列中对齐时,第一种模式很有用。例如:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print line.split()

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

第二种模式适用于分隔数据,例如 CSV,其中重复的逗号表示空字段。例如:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print line.split(',')

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python's str.split(delimiter)method:

请注意,结果字段的数量比分隔符的数量大 1。想想剪一根绳子。如果你不做任何切割,你就有一块。切一刀,得到两块。切两下,得到三块。Python 的str.split(delimiter)方法也是如此:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

问题:有没有更方便的方法来计算字符串中的行数?

Yes, there are a couple of easy ways. One uses str.count()and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlinesapproach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

是的,有几种简单的方法。一个使用str.count(),另一个使用str.splitlines()。除非最后一行缺少\n. 如果缺少最后的换行符,str.splitlines方法将给出准确的答案。一种更快但也准确的技术使用 count 方法,但随后会针对最终的换行符对其进行更正:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4    

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

来自@Kaz 的问题:为什么两种截然不同的算法硬塞到一个函数中?

The signature for str.splitis about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn't "terrible" either. For the most part, Guido's API design choices have stood the test of time.

str.split的签名大约有 20 年的历史,那个时代的许多 API 都非常实用。虽然不完美,但方法签名也不是“可怕的”。在大多数情况下,Guido 的 API 设计选择经受住了时间的考验。

The current API is not without advantages. Consider strings such as:

当前的 API 并非没有优势。考虑字符串,例如:

ps_aux_header  = "USER               PID  %CPU %MEM      VSZ"
patient_header = "name,age,height,weight"

When asked to break these strings into fields, people tend to describe both using the same English word, "split". When asked to read code such as fields = line.split()or fields = line.split(','), people tend to correctly interpret the statements as "splits a line into fields".

当要求将这些字符串分解为字段时,人们倾向于使用相同的英文单词“split”来描述两者。当被要求阅读诸如fields = line.split()或 之类的代码时fields = line.split(','),人们倾向于将这些语句正确地解释为“将一行拆分为多个字段”。

Microsoft Excel's text-to-columns toolmade a similar API choice and incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.

Microsoft Excel 的文本到列工具做出了类似的 API 选择,并将两种拆分算法合并到同一工具中。人们似乎在心理上将场分裂建模为一个单一的概念,即使涉及到不止一种算法。

回答by unwind

It seems to simply be the way it's supposed to work, according to the documentation:

根据文档,这似乎只是它应该工作的方式:

Splitting an empty string with a specified separator returns [''].

If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].

使用指定的分隔符拆分空字符串将返回['']

如果未指定 sep 或为 None ,则应用不同的拆分算法:将连续空格的运行视为单个分隔符,如果字符串有前导或尾随空格,则结果将在开头或结尾不包含空字符串。因此,拆分空字符串或仅由空格组成的字符串与 None 分隔符将返回 []。

So, to make it clearer, the split()function implements two different splitting algorithms, and uses the presence of an argument to decide which one to run. This might be because it allows optimizing the one for no arguments more than the one with arguments; I don't know.

因此,为了更清楚,该split()函数实现了两种不同的拆分算法,并使用参数的存在来决定运行哪个。这可能是因为它允许优化没有参数的参数而不是有参数的参数;我不知道。

回答by Jakub M.

To count lines, you can count the number of line breaks:

要计算行数,您可以计算换行符的数量:

n_lines = sum(1 for s in the_string if s == "\n") + 1 # add 1 for last line

Edit:

编辑

The other answerwith built-in countis more suitable, actually

内置的另一个答案count更合适,实际上

回答by Gareth Webber

Use count():

使用count()

s = "Line 1\nLine2\nLine3"
n_lines = s.count('\n') + 1

回答by Lennart Regebro

.split()without parameters tries to be clever. It splits on any whitespace, tabs, spaces, line feeds etc, and it also skips all empty strings as a result of this.

.split()没有参数试图变得聪明。它在任何空格、制表符、空格、换行符等上进行拆分,因此它还会跳过所有空字符串。

>>> "  fii    fbar \n bopp ".split()
['fii', 'fbar', 'bopp']

Essentially, .split()without parameters are used to extract words from a string, as opposed to .split()with parameters which just takes a string and splits it.

本质上,.split()不带参数用于从字符串中提取单词,而不是.split()带参数,它只需要一个字符串并将其拆分。

That's the reason for the difference.

这就是差异的原因。

And yeah, counting lines by splitting is not an efficient way. Count the number of line feeds, and add one if the string doesn't end with a line feed.

是的,通过拆分来计算行数并不是一种有效的方法。计算换行次数,如果字符串不以换行结束,则加一个。

回答by Bakuriu

>>> print str.split.__doc__
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string.  If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.

Note the last sentence.

注意最后一句话。

To count lines you can simply count how many \nare there:

要计算行,您可以简单地计算有多少行\n

line_count = some_string.count('\n') + some_string[-1] != '\n'

The last part takes into account the last line that do not end with \n, even though this means that Hello, World!and Hello, World!\nhave the same line count(which for me is reasonable), otherwise you can simply add 1to the count of \n.

最后一部分考虑到不结束最后一行\n,即使这意味着,Hello, World!Hello, World!\n具有相同的行数(这对我来说是合理的),否则,你可以简单地添加1到的计数\n