在 Python 中标记保留分隔符的字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1820336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
tokenize a string keeping delimiters in Python
提问by fortran
Is there any equivalent to str.split
in Python that also returns the delimiters?
str.split
在 Python 中是否有任何等价物也返回分隔符?
I need to preserve the whitespace layout for my output after processing some of the tokens.
在处理一些标记后,我需要为我的输出保留空白布局。
Example:
例子:
>>> s="\tthis is an example"
>>> print s.split()
['this', 'is', 'an', 'example']
>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', ' ', 'example']
Thanks!
谢谢!
回答by Jonathan Feinberg
How about
怎么样
import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)
回答by Denis Otkidach
>>> re.compile(r'(\s+)').split("\tthis is an example")
['', '\t', 'this', ' ', 'is', ' ', 'an', ' ', 'example']
回答by Tim Pietzcker
the re
module provides this functionality:
该re
模块提供此功能:
>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
(quoted from the Python documentation).
(引自 Python 文档)。
For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example')
.
对于您的示例(在空格上拆分),请使用re.split('(\s+)', '\tThis is an example')
.
The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.
关键是将要拆分的正则表达式括在捕获括号中。这样,分隔符就会添加到结果列表中。
Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip()
method on your input string first.
编辑:正如所指出的,任何前面/后面的定界符当然也会被添加到列表中。为避免这种情况,您可以.strip()
先在输入字符串上使用该方法。
回答by jcdyer
Have you looked at pyparsing? Example borrowed from the pyparsing wiki:
你看过pyparsing吗?从pyparsing wiki借用的示例:
>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
... print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
...
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})
回答by fortran
Thanks guys for pointing for the re
module, I'm still trying to decide between that and using my own function that returns a sequence...
感谢各位指出该re
模块,我仍在尝试在它和使用我自己的返回序列的函数之间做出决定......
def split_keep_delimiters(s, delims="\t\n\r "):
delim_group = s[0] in delims
start = 0
for index, char in enumerate(s):
if delim_group != (char in delims):
delim_group ^= True
yield s[start:index]
start = index
yield s[start:index+1]
If I had time I'd benchmark them xD
如果我有时间,我会对它们进行基准测试 xD