Python 中是否有 `string.split()` 的生成器版本？

Question

提问by Manoj Govindan

string.split()returns a listinstance. Is there a version that returns a generatorinstead? Are there any reasons against having a generator version?

string.split()返回一个列表实例。是否有返回生成器的版本？是否有任何理由反对使用生成器版本？

Answer 1

采纳答案by ninjagecko

It is highly probable that re.finditeruses fairly minimal memory overhead.

极有可能re.finditer使用相当小的内存开销。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

Demo:

演示：

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

edit:I have just confirmed that this takes constant memory in python 3.2.1, assuming my testing methodology was correct. I created a string of very large size (1GB or so), then iterated through the iterable with a forloop (NOT a list comprehension, which would have generated extra memory). This did not result in a noticeable growth of memory (that is, if there was a growth in memory, it was far far less than the 1GB string).

编辑：我刚刚确认这在 python 3.2.1 中需要恒定内存，假设我的测试方法是正确的。我创建了一个非常大的字符串（1GB 左右），然后用for循环遍历可迭代对象（不是列表理解，它会产生额外的内存）。这并没有导致显着的内存增长（也就是说，如果内存有增长，远远小于 1GB 字符串）。

Answer 2

回答by Ignacio Vazquez-Abrams

No, but it should be easy enough to write one using itertools.takewhile().

不，但是使用itertools.takewhile().

EDIT:

编辑：

Very simple, half-broken implementation:

非常简单的半中断实现：

import itertools
import string

def isplitwords(s):
  i = iter(s)
  while True:
    r = []
    for c in itertools.takewhile(lambda x: not x in string.whitespace, i):
      r.append(c)
    else:
      if r:
        yield ''.join(r)
        continue
      else:
        raise StopIteration()

Answer 3

回答by Dave Webb

I don't see any obvious benefit to a generator version of split(). The generator object is going to have to contain the whole string to iterate over so you're not going to save any memory by having a generator.

~~我认为生成器版本没有任何明显的好处split()。生成器对象将不得不包含要迭代的整个字符串，因此您不会通过使用生成器来节省任何内存。~~

If you wanted to write one it would be fairly easy though:

如果你想写一个，那会很容易：

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

Answer 4

回答by Bernd Petersohn

This is generator version of split()implemented via re.search()that does not have the problem of allocating too many substrings.

这是split()实现的生成器版本re.search()，没有分配太多子串的问题。

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

EDIT:Corrected handling of surrounding whitespace if no separator chars are given.

编辑：如果没有给出分隔符，则更正了对周围空白的处理。

Answer 5

回答by Eli Collins

The most efficient way I can think of it to write one using the offsetparameter of the str.find()method. This avoids lots of memory use, and relying on the overhead of a regexp when it's not needed.

我能想到的最有效的方法是使用方法的offset参数编写一个str.find()。这避免了大量内存使用，并在不需要时依赖正则表达式的开销。

[edit 2016-8-2: updated this to optionally support regex separators]

[编辑 2016-8-2：更新此内容以选择性地支持正则表达式分隔符]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

This can be used like you want...

这可以随心所欲地使用...

>>> print list(isplit("abcb","b"))
['a','c','']

While there is a little bit of cost seeking within the string each time find() or slicing is performed, this should be minimal since strings are represented as continguous arrays in memory.

虽然每次执行 find() 或切片时在字符串中都会有一些成本搜索，但这应该是最小的，因为字符串在内存中表示为连续数组。

Answer 6

回答by Oleh Prypin

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

这是我的实现，它比这里的其他答案要快得多，也更完整。对于不同的情况，它有 4 个独立的子功能。

I'll just copy the docstring of the main str_splitfunction:

我将复制主str_split函数的文档字符串：

str_split(s, *delims, empty=None)

Split the string sby the rest of the arguments, possibly omitting empty parts (emptykeyword argument is responsible for that). This is a generator function.

s通过其余参数拆分字符串，可能会省略空部分（empty关键字参数负责）。这是一个生成器函数。

When only one delimiter is supplied, the string is simply split by it. emptyis then Trueby default.

当只提供一个分隔符时，字符串会被它简单地分割。 empty然后True是默认。

str_split('[]aaa[][]bb[c', '[]')
    -> '', 'aaa', '', 'bb[c'
str_split('[]aaa[][]bb[c', '[]', empty=False)
    -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if emptyis set to True, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters.

当提供多个分隔符时，默认情况下字符串会被这些分隔符的最长可能序列分割，或者，如果empty设置为 True，则还包括分隔符之间的空字符串。请注意，这种情况下的分隔符只能是单个字符。

str_split('aaa, bb : c;', ' ', ',', ':', ';')
    -> 'aaa', 'bb', 'c'
str_split('aaa, bb : c;', *' ,:;', empty=True)
    -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, string.whitespaceis used, so the effect is the same as str.split(), except this function is a generator.

当没有提供分隔符时，string.whitespace使用，所以效果与相同str.split()，只是这个函数是一个生成器。

str_split('aaa\t  bb c \n')
    -> 'aaa', 'bb', 'c'

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\t  bb c \n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

This function works in Python 3, and an easy, though quite ugly, fix can be applied to make it work in both 2 and 3 versions. The first lines of the function should be changed to:

这个函数在 Python 3 中工作，并且可以应用一个简单但很丑陋的修复来使其在 2 和 3 版本中工作。该函数的第一行应更改为：

def str_split(s, *delims, **kwargs):
    """...docstring..."""
    empty = kwargs.get('empty')

Answer 7

回答by travelingbones

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

Answer 8

回答by dshepherd

I wrote a version of @ninjagecko's answer that behaves more like string.split (i.e. whitespace delimited by default and you can specify a delimiter).

我写了一个@ninjagecko 答案的版本，它的行为更像 string.split（即默认情况下用空格分隔，您可以指定一个分隔符）。

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

Here are the tests I used (in both python 3 and python 2):

以下是我使用的测试（在 python 3 和 python 2 中）：

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1", "\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python's regex module says that it does "the right thing"for unicode whitespace, but I haven't actually tested it.

python 的 regex 模块说它对 unicode 空格做了“正确的事情”，但我还没有真正测试过它。

Also available as a gist.

也可作为要点。

Answer 9

回答by reubano

If you would also like to be able to readan iterator (as well as returnone) try this:

如果您还希望能够读取迭代器（以及返回一个），请尝试以下操作：

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

Usage

用法

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

Answer 10

回答by c z

Did some performance testing on the various methods proposed (I won't repeat them here). Some results:

对提出的各种方法进行了一些性能测试（我不会在这里重复）。一些结果：

str.split(default = 0.3461570239996945
manual search (by character) (one of Dave Webb's answer's) = 0.8260340550004912
re.finditer(ninjagecko's answer) = 0.698872097000276
str.find(one of Eli Collins's answers) = 0.7230395330007013
itertools.takewhile(Ignacio Vazquez-Abrams's answer) = 2.023023967998597
str.split(..., maxsplit=1)recursion = N/A?

str.split（默认值 = 0.3461570239996945
手动搜索（按字符）（戴夫韦伯的答案之一）= 0.8260340550004912
re.finditer（忍者的回答）= 0.698872097000276
str.find（Eli Collins 的答案之一）= 0.7230395330007013
itertools.takewhile（伊格纳西奥·巴斯克斯-艾布拉姆斯的回答）= 2.023023967998597
str.split(..., maxsplit=1)递归 = 不适用？

?The recursion answers (string.splitwith maxsplit = 1) fail to complete in a reasonable time, given string.splits speed they may work better on shorter strings, but then I can't see the use-case for short strings where memory isn't an issue anyway.

? 递归答案 ( string.splitwith maxsplit = 1) 无法在合理的时间内完成，鉴于string.splits 速度，它们可能在较短的字符串上工作得更好，但是我看不到内存不是问题的短字符串的用例。

Tested using timeiton:

测试使用timeit：

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

This raises another question as to why string.splitis so much faster despite its memory usage.

这提出了另一个问题，即为什么string.split尽管使用内存但速度如此之快。

Python 中是否有 `string.split()` 的生成器版本？

提问by Manoj Govindan

采纳答案by ninjagecko

回答by Ignacio Vazquez-Abrams

回答by Dave Webb

回答by Bernd Petersohn

回答by Eli Collins

回答by Oleh Prypin

回答by travelingbones

回答by dshepherd

回答by reubano

回答by c z

相关推荐

最近更新

标签

Python 中是否有 `string.split()` 的生成器版本？

提问by Manoj Govindan

采纳答案by ninjagecko

回答by Ignacio Vazquez-Abrams

回答by Dave Webb

回答by Bernd Petersohn

回答by Eli Collins

回答by Oleh Prypin

回答by travelingbones

回答by dshepherd

回答by reubano

回答by c z

相关推荐

Python 将字符串的 Pandas DataFrame 转换为直方图

获取python列表中字符的索引

Python：正则表达式 findall 返回一个列表，为什么尝试访问列表元素 [0] 会返回错误？

Python 如何从列表元素中删除\n？

相关推荐

最近更新

标签