python 跳过标题行的更多 Pythonic 方式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1730649/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 22:58:40  来源:igfitidea点击:

More pythonic way of skipping header lines

python

提问by pufferfish

Is there a shorter (perhaps more pythonic) way of opening a text file and reading past the lines that start with a comment character?

是否有一种更短(可能更像 Python)的方式来打开文本文件并阅读以注释字符开头的行?

In other words, a neater way of doing this

换句话说,一种更简洁的方式来做到这一点

fin = open("data.txt")
line = fin.readline()
while line.startswith("#"):
    line = fin.readline()

回答by Robert Rossney

At this stage in my arc of learning Python, I find this most Pythonic:

在我学习 Python 的这个阶段,我发现这个最 Pythonic:

def iscomment(s):
   return s.startswith('#')

from itertools import dropwhile
with open(filename, 'r') as f:
    for line in dropwhile(iscomment, f):
       # do something with line

to skip all of the lines at the top of the file starting with #. To skip all lines starting with #:

跳过文件顶部以#.开头的所有行。跳过所有以 开头的行#

from itertools import ifilterfalse
with open(filename, 'r') as f:
    for line in ifilterfalse(iscomment, f):
       # do something with line

That's almost all about readability for me; functionally there's almost no difference between:

对我来说,这几乎都是关于可读性的;在功能上几乎没有区别:

for line in ifilterfalse(iscomment, f))

and

for line in (x for x in f if not x.startswith('#'))

Breaking out the test into its own function makes the intent of the code a little clearer; it also means that if your definition of a comment changes you have one place to change it.

将测试分解成它自己的功能使代码的意图更清晰一些;这也意味着如果您对评论的定义发生变化,您可以在一处进行更改。

回答by SilentGhost

for line in open('data.txt'):
    if line.startswith('#'):
        continue
    # work with line

of course, if your commented lines are only at the beginning of the file, you might use some optimisations.

当然,如果您的注释行仅在文件的开头,您可能会使用一些优化。

回答by ephemient

from itertools import dropwhile
for line in dropwhile(lambda line: line.startswith('#'), file('data.txt')):
    pass

回答by Wim

If you want to filter out allcomment lines (not just those at the start of the file):

如果要过滤掉所有注释行(不仅仅是文件开头的注释行):

for line in file("data.txt"):
  if not line.startswith("#"):
    # process line

If you only want to skip those at the start then see ephemient's answer using itertools.dropwhile

如果您只想在开始时跳过那些,请使用ephemient的答案itertools.dropwhile

回答by Wernsey

You could use a generator function

您可以使用生成器功能

def readlines(filename):
    fin = open(filename)
    for line in fin:
        if not line.startswith("#"):
            yield line

and use it like

并像使用它一样

for line in readlines("data.txt"):
    # do things
    pass

Depending on exactly where the files come from, you may also want to strip()the lines before the startswith()check. I once had to debug a script like that months after it was written because someone put in a couple of space characters before the '#'

根据文件的确切来源,您可能还需要检查strip()前的行startswith()。我曾经不得不在写好几个月后调试这样的脚本,因为有人在“#”之前添加了几个空格字符

回答by Jim Dennis

As a practical matter if I knew I was dealing with reasonable sized text files (anything which will comfortably fit in memory) then I'd problem go with something like:

实际上,如果我知道我正在处理合理大小的文本文件(任何可以轻松放入内存的文件),那么我会遇到以下问题:

f = open("data.txt")
lines = [ x for x in f.readlines() if x[0] != "#" ]

... to snarf in the whole file and filter out all lines that begin with the octothorpe.

... 在整个文件中进行 snarf 并过滤掉所有以 octothorpe 开头的行。

As others have pointed out one might want ignore leading whitespace occurring before the octothorpe like so:

正如其他人指出的那样,人们可能希望忽略在 octothorpe 之前出现的前导空格,如下所示:

lines = [ x for x in f.readlines() if not x.lstrip().startswith("#") ]

I like this for its brevity.

我喜欢它的简洁性。

This assumes that we want to strip out all of the comment lines.

这假设我们要删除所有注释行。

We can also "chop" the last characters (almost always newlines) off the end of each using:

我们还可以使用以下方法“切掉”每个字符末尾的最后一个字符(几乎总是换行符):

lines = [ x[:-1] for x in ... ]

... assuming that we're not worried about the infamously obscure issue of a missing final newline on the last line of the file. (The only time a line from the .readlines()or related file-like object methods might NOT end in a newline is at EOF).

...假设我们不担心文件最后一行缺少最终换行符这一臭名昭著的晦涩问题。(来自.readlines()或相关文件类对象方法的一行可能不会以换行符结尾的唯一时间是在 EOF 处)。

In reasonably recent versions of Python one can "chomp" (only newlines) off the ends of the lines using a conditional expression like so:

在相当新的 Python 版本中,可以使用如下条件表达式从行尾“截断”(仅换行):

lines = [ x[:-1] if x[-1]=='\n' else x for x in ... ]

... which is about as complicated as I'll go with a list comprehension for legibility's sake.

......为了易读性,这与我将使用列表理解一样复杂。

If we were worried about the possibility of an overly large file (or low memory constraints) impacting our performance or stability, and we're using a version of Python that's recent enough to support generator expressions (which are more recent additions to the language than the list comprehensions I've been using here), then we could use:

如果我们担心过大的文件(或低内存限制)会影响我们的性能或稳定性,并且我们正在使用一个足够新的 Python 版本来支持生成器表达式(这是语言的最新添加,而不是我在这里使用的列表推导式),然后我们可以使用:

for line in (x[:-1] if x[-1]=='\n' else x for x in
  f.readlines() if x.lstrip().startswith('#')):

    # do stuff with each line

... is at the limits of what I'd expect anyone else to parse in one line a year after the code's been checked in.

...在代码签入后一年内我希望其他人在一行中解析的内容已达到极限。

If the intent is only to skip "header" lines then I think the best approach would be:

如果目的只是跳过“标题”行,那么我认为最好的方法是:

f = open('data.txt')
for line in f:
    if line.lstrip().startswith('#'):
        continue

... and be done with it.

......并完成它。

回答by ???u

You could make a generator that loops over the file that skips those lines:

您可以制作一个生成器来循环跳过这些行的文件:

fin = open("data.txt")
fileiter = (l for l in fin if not l.startswith('#'))

for line in fileiter:
   ...

回答by Corey Porter

You could do something like

你可以做类似的事情

def drop(n, seq):
    for i, x in enumerate(seq):
        if i >= n:
            yield x

And then say

然后说

for line in drop(1, file(filename)):
    # whatever

回答by steveha

I like @iWerner's generator function idea. One small change to his code and it does what the question asked for.

我喜欢@iWerner 的生成器函数的想法。对他的代码进行了一个小改动,它满足了问题的要求。

def readlines(filename):
    f = open(filename)
    # discard first lines that start with '#'
    for line in f:
        if not line.lstrip().startswith("#"):
            break
    yield line

    for line in f:
        yield line

and use it like

并像使用它一样

for line in readlines("data.txt"):
    # do things
    pass

But here is a different approach. This is almost very simple. The idea is that we open the file, and get a file object, which we can use as an iterator. Then we pull the lines we don't want out of the iterator, and just return the iterator. This would be ideal if we always knew how many lines to skip. The problem here is we don't know how many lines we need to skip; we just need to pull lines and look at them. And there is no way to put a line back into the iterator, once we have pulled it.

但这是一种不同的方法。这几乎非常简单。这个想法是我们打开文件,并获得一个文件对象,我们可以将其用作迭代器。然后我们将不需要的行从迭代器中拉出,然后返回迭代器。如果我们总是知道要跳过多少行,这将是理想的。这里的问题是我们不知道需要跳过多少行;我们只需要拉线并查看它们。一旦我们拉了一条线,就没有办法将它放回到迭代器中。

So: open the iterator, pull lines and count how manyhave the leading '#' character; then use the .seek()method to rewind the file, pull the correct number again, and return the iterator.

所以:打开迭代器,拉线并计算有多少有前导的“#”字符;然后使用该.seek()方法倒带文件,再次拉出正确的数字,并返回迭代器。

One thing I like about this: you get the actual file object back, with all its methods; you can just use this instead of open()and it will work in all cases. I renamed the function to open_my_text()to reflect this.

我喜欢的一件事是:您可以返回实际的文件对象及其所有方法;你可以只使用它而不是open()它,它在所有情况下都有效。我将函数重命名为open_my_text()以反映这一点。

def open_my_text(filename):
    f = open(filename, "rt")
    # count number of lines that start with '#'
    count = 0
    for line in f:
        if not line.lstrip().startswith("#"):
            break
        count += 1

    # rewind file, and discard lines counted above
    f.seek(0)
    for _ in range(count):
        f.readline()

    # return file object with comment lines pre-skipped
    return f

Instead of f.readline()I could have used f.next()(for Python 2.x) or next(f)(for Python 3.x) but I wanted to write it so it was portable to any Python.

而不是f.readline()我可以使用f.next()(对于 Python 2.x)或next(f)(对于 Python 3.x),但我想编写它以便它可以移植到任何 Python。

EDIT: Okay, I know nobody cares and I"m not getting any upvotes for this, but I have re-written my answer one last time to make it more elegant.

编辑:好的,我知道没有人在乎,我没有为此获得任何赞成,但我最后一次重写了我的答案以使其更加优雅。

You can't put a line back into an iterator. But, you can open a file twice, and get two iterators; given the way file caching works, the second iterator is almost free. If we imagine a file with a megabyte of '#' lines at the top, this version would greatly outperform the previous version that calls f.seek(0).

您不能将一行放回迭代器中。但是,你可以打开一个文件两次,得到两个迭代器;鉴于文件缓存的工作方式,第二个迭代器几乎是免费的。如果我们想象一个文件顶部有 1 兆字节的“#”行,这个版本将大大优于调用f.seek(0).

def open_my_text(filename):
    # open the same file twice to get two file objects
    # (We are opening the file read-only so this is safe.)
    ftemp = open(filename, "rt")
    f = open(filename, "rt")

    # use ftemp to look at lines, then discard from f
    for line in ftemp:
        if not line.lstrip().startswith("#"):
            break
        f.readline()

    # return file object with comment lines pre-skipped
    return f

This version is much better than the previous version, and it still returns a full file object with all its methods.

这个版本比以前的版本好很多,它仍然返回一个完整的文件对象及其所有方法。