在 Python 中标记保留分隔符的字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1820336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
tokenize a string keeping delimiters in Python
提问by fortran
Is there any equivalent to str.splitin Python that also returns the delimiters?
str.split在 Python 中是否有任何等价物也返回分隔符?
I need to preserve the whitespace layout for my output after processing some of the tokens.
在处理一些标记后,我需要为我的输出保留空白布局。
Example:
例子:
>>> s="\tthis is an example"
>>> print s.split()
['this', 'is', 'an', 'example']
>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', ' ', 'example']
Thanks!
谢谢!
回答by Jonathan Feinberg
How about
怎么样
import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)
回答by Denis Otkidach
>>> re.compile(r'(\s+)').split("\tthis is an example")
['', '\t', 'this', ' ', 'is', ' ', 'an', ' ', 'example']
回答by Tim Pietzcker
the remodule provides this functionality:
该re模块提供此功能:
>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
(quoted from the Python documentation).
(引自 Python 文档)。
For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').
对于您的示例(在空格上拆分),请使用re.split('(\s+)', '\tThis is an example').
The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.
关键是将要拆分的正则表达式括在捕获括号中。这样,分隔符就会添加到结果列表中。
Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip()method on your input string first.
编辑:正如所指出的,任何前面/后面的定界符当然也会被添加到列表中。为避免这种情况,您可以.strip()先在输入字符串上使用该方法。
回答by jcdyer
Have you looked at pyparsing? Example borrowed from the pyparsing wiki:
你看过pyparsing吗?从pyparsing wiki借用的示例:
>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
... print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
...
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})
回答by fortran
Thanks guys for pointing for the remodule, I'm still trying to decide between that and using my own function that returns a sequence...
感谢各位指出该re模块,我仍在尝试在它和使用我自己的返回序列的函数之间做出决定......
def split_keep_delimiters(s, delims="\t\n\r "):
delim_group = s[0] in delims
start = 0
for index, char in enumerate(s):
if delim_group != (char in delims):
delim_group ^= True
yield s[start:index]
start = index
yield s[start:index+1]
If I had time I'd benchmark them xD
如果我有时间,我会对它们进行基准测试 xD

