在 Python 中标记保留分隔符的字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1820336/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:08:17  来源:igfitidea点击:

tokenize a string keeping delimiters in Python

pythonstringsplittokenize

提问by fortran

Is there any equivalent to str.splitin Python that also returns the delimiters?

str.split在 Python 中是否有任何等价物也返回分隔符?

I need to preserve the whitespace layout for my output after processing some of the tokens.

在处理一些标记后,我需要为我的输出保留空白布局。

Example:

例子:

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

Thanks!

谢谢!

回答by Jonathan Feinberg

How about

怎么样

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)

回答by Denis Otkidach

>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

回答by Tim Pietzcker

the remodule provides this functionality:

re模块提供此功能:

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

(quoted from the Python documentation).

(引自 Python 文档)。

For your example (split on whitespace), use re.split('(\s+)', '\tThis is an example').

对于您的示例(在空格上拆分),请使用re.split('(\s+)', '\tThis is an example').

The key is to enclose the regex on which to split in capturing parentheses. That way, the delimiters are added to the list of results.

关键是将要拆分的正则表达式括在捕获括号中。这样,分隔符就会添加到结果列表中。

Edit: As pointed out, any preceding/trailing delimiters will of course also be added to the list. To avoid that you can use the .strip()method on your input string first.

编辑:正如所指出的,任何前面/后面的定界符当然也会被添加到列表中。为避免这种情况,您可以.strip()先在输入字符串上使用该方法。

回答by jcdyer

Have you looked at pyparsing? Example borrowed from the pyparsing wiki:

你看过pyparsing吗?从pyparsing wiki借用的示例

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})

回答by fortran

Thanks guys for pointing for the remodule, I'm still trying to decide between that and using my own function that returns a sequence...

感谢各位指出该re模块,我仍在尝试在它和使用我自己的返回序列的函数之间做出决定......

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

If I had time I'd benchmark them xD

如果我有时间,我会对它们进行基准测试 xD