Python 获取文本文件的第一行和最后一行的最有效方法是什么？

Question

提问by pasbino

I have a text file which contains a time stamp on each line. My goal is to find the time range. All the times are in order so the first line will be the earliest time and the last line will be the latest time. I only need the very first and very last line. What would be the most efficient way to get these lines in python?

我有一个文本文件，每行都包含一个时间戳。我的目标是找到时间范围。所有的时间都是按顺序排列的，所以第一行是最早的时间，最后一行是最晚的时间。我只需要第一行和最后一行。在python中获取这些行的最有效方法是什么？

Note: These files are relatively large in length, about 1-2 million lines each and I have to do this for several hundred files.

注意：这些文件的长度相对较大，每个大约有 1-2 百万行，我必须对数百个文件执行此操作。

Answer 1

采纳答案by SilentGhost

docs for io module

io 模块的文档

with open(fname, 'rb') as fh:
    first = next(fh).decode()

    fh.seek(-1024, 2)
    last = fh.readlines()[-1].decode()

The variable value here is 1024: it represents the average string length. I choose 1024 only for example. If you have an estimate of average line length you could just use that value times 2.

这里的变量值是 1024：它代表平均字符串长度。例如，我仅选择 1024。如果您有平均线长度的估计值，您可以使用该值乘以 2。

Since you have no idea whatsoever about the possible upper bound for the line length, the obvious solution would be to loop over the file:

由于您对行长度的可能上限一无所知，因此显而易见的解决方案是遍历文件：

for line in fh:
    pass
last = line

You don't need to bother with the binary flag you could just use open(fname).

您无需为可以使用的二进制标志而烦恼open(fname)。

ETA: Since you have many files to work on, you could create a sample of couple of dozens of files using random.sampleand run this code on them to determine length of last line. With an a priori large value of the position shift (let say 1 MB). This will help you to estimate the value for the full run.

ETA：由于您有许多文件要处理，因此您可以创建一个包含几十个文件的示例，random.sample并在它们上运行此代码以确定最后一行的长度。具有先验大的位置偏移值（假设为 1 MB）。这将帮助您估计完整运行的值。

Answer 2

回答by beitar

Can you use unix commands? I think using head -1and tail -n 1are probably the most efficient methods. Alternatively, you could use a simple fid.readline()to get the first line and fid.readlines()[-1], but that may take too much memory.

你可以使用unix命令吗？我认为使用head -1和tail -n 1可能是最有效的方法。或者，您可以使用 simplefid.readline()来获取第一行 and fid.readlines()[-1]，但这可能会占用太多内存。

Answer 3

回答by msw

Getting the first line is trivially easy. For the last line, presuming you know an approximate upper bound on the line length, os.lseeksome amount from SEEK_ENDfind the second to last line ending and then readline()the last line.

获得第一行非常容易。对于最后一行，假设您知道行长度的近似上限，os.lseek一定数量从SEEK_END找到第二行到最后一行结束，然后readline()最后一行。

Answer 4

回答by mik01aj

Here's a modified version of SilentGhost's answer that will do what you want.

这是 SilentGhost 答案的修改版本，可以满足您的需求。

with open(fname, 'rb') as fh:
    first = next(fh)
    offs = -100
    while True:
        fh.seek(offs, 2)
        lines = fh.readlines()
        if len(lines)>1:
            last = lines[-1]
            break
        offs *= 2
    print first
    print last

No need for an upper bound for line length here.

这里不需要行长度的上限。

Answer 5

回答by Trasp

To read both the first and final line of a file you could...

要同时读取文件的第一行和最后一行，您可以...

open the file for reading, ...
read the first line, using the builtin readline(), ...
seek to the end of the file, ...
step backwards until you find the EOLpreceding the second last line and ...
read the last line from there.

打开文件进行阅读，...
阅读第一行，使用内置readline(), ...
寻找到文件的末尾，...
向后退一步，直到您在倒数第二行之前找到EOL并且...
从那里阅读最后一行。

def readlastline(f):
    f.seek(-2, 2)              # Jump to the second last byte.
    while f.read(1) != b"\n":  # Until EOL is found...
        f.seek(-2, 1)          # ...jump back, over the read byte plus one more.

with open(file, "rb") as f:
    first = f.readline()
    last = readlastline(f)

Jump to the secondlast byte directly to prevent a trailing newline character from prohibiting the last line written to be read.

直接跳转到倒数第二个字节，防止尾随换行符禁止写入的最后一行被读取。

The current offset is pushed ahead by one every time a byte is read so the stepping backwards is done two bytes at a time, past the recently read byte and the byte to read next.

每次读取一个字节时，当前偏移量都会向前推进一个，因此一次向后移动两个字节，经过最近读取的字节和下一个要读取的字节。

The whenceparameter passed to fseek(offset, whence=0)indicates that fseekshould seek to a position offsetbytes relative to...

whence传递给的参数fseek(offset, whence=0)表示fseek应该寻找一个位置offset字节相对于...

0or os.SEEK_SET= The beginning of the file.
1or os.SEEK_CUR= The current position.
2or os.SEEK_END= The end of the file.

0或os.SEEK_SET= 文件的开头。
1或os.SEEK_CUR= 当前位置。
2或os.SEEK_END= 文件的结尾。

Efficiency

效率

1-2 million lines each and I have to do this for several hundred files.

每行 1-2 百万行，我必须为数百个文件执行此操作。

I timed this method and compared it against against the top answer.

我对这种方法计时并将其与最佳答案进行了比较。

10k iterations processing a file of 6k lines totalling 200kB: 1.62s vs 6.92s.
100 iterations processing a file of 6k lines totalling 1.3GB: 8.93s vs 86.95.

Millions of lines would increase the difference a lotmore.

数以百万计的行会增加差很多了。

Exakt code used for timing:

用于计时的 Exakt 代码：

with open(file, "rb") as f:
    first = f.readline()     # Read and store the first line.
    for last in f: pass      # Read all lines, keep final value.

Amendment

修正案

A more complex, and harder to read, variation to address comments and issues raised since.

一个更复杂、更难阅读的变体，用于解决此后提出的评论和问题。

Return empty string when parsing empty file, raised by comment.
Return all content when no delimiter is found, raised by comment.
Avoid relative offsets to support text mode, raised by comment.
UTF16/UTF32 hack, raised by comment.

解析空文件时返回空字符串，由comment引发。
找不到分隔符时返回所有内容，由comment引发。
避免相对偏移以支持文本模式，由注释引发。
UTF16/UTF32 hack，由评论提出。

+ Support for multibyte delimiters, readlast(b'X<br>Y', b'<br>', fixed=False).

+ 支持多字节分隔符，readlast(b'X<br>Y', b'<br>', fixed=False).

#!/bin/python3

from os import SEEK_END

def readlast(f, sep, fixed=True):
    """Read the last segment from a file-like object.

    :param f: File to read last line from.
    :type  f: file-like object
    :param sep: Segment separator (delimiter).
    :type  sep: bytes, str
    :param fixed: Treat data in ``f`` as a chain of fixed size blocks.
    :type  fixed: bool
    :returns: Last line of file.
    :rtype  : bytes, str
    """
    bs   = len(sep)
    step = bs if fixed else 1
    if not bs:
        raise ValueError("Zero-length separator.")
    try:
        o = f.seek(0, SEEK_END)
        o = f.seek(o-bs-step)    # - Ignore trailing delimiter 'sep'.
        while f.read(bs) != sep: # - Until reaching 'sep': Read data, seek past
            o = f.seek(o-step)   #  read data *and* the data to read next.
    except (OSError,ValueError): # - Beginning of file reached.
        f.seek(0)
    return f.read()

def test_readlast():
    from io import BytesIO, StringIO

    # Text mode.
    f = StringIO("first\nlast\n")
    assert readlast(f, "\n") == "last\n"

    # Bytes.
    f = BytesIO(b'first|last')
    assert readlast(f, b'|') == b'last'

    # Bytes, UTF-8.
    f = BytesIO("X\nY\n".encode("utf-8"))
    assert readlast(f, b'\n').decode() == "Y\n"

    # Bytes, UTF-16.
    f = BytesIO("X\nY\n".encode("utf-16"))
    assert readlast(f, b'\n\x00').decode('utf-16') == "Y\n"

    # Bytes, UTF-32.
    f = BytesIO("X\nY\n".encode("utf-32"))
    assert readlast(f, b'\n\x00\x00\x00').decode('utf-32') == "Y\n"

    # Multichar delimiter.
    f = StringIO("X<br>Y")
    assert readlast(f, "<br>", fixed=False) == "Y"

    # Make sure you use the correct delimiters.
    seps = { 'utf8': b'\n', 'utf16': b'\n\x00', 'utf32': b'\n\x00\x00\x00' }
    assert "\n".encode('utf8' )     == seps['utf8']
    assert "\n".encode('utf16')[2:] == seps['utf16']
    assert "\n".encode('utf32')[4:] == seps['utf32']

    # Edge cases.
    edges = (
        # Text , Match
        (""    , ""  ), # Empty file, empty string.
        ("X"   , "X" ), # No delimiter, full content.
        ("\n"  , "\n"),
        ("\n\n", "\n"),
        # UTF16/32 encoded U+270A (b"\n\x00\n'\n\x00"/utf16)
        (b'\n\xe2\x9c\x8a\n'.decode(), b'\xe2\x9c\x8a\n'.decode()),
    )
    for txt, match in edges:
        for enc,sep in seps.items():
            assert readlast(BytesIO(txt.encode(enc)), sep).decode(enc) == match

if __name__ == "__main__":
    import sys
    for path in sys.argv[1:]:
        with open(path) as f:
            print(f.readline()    , end="")
            print(readlast(f,"\n"), end="")

Answer 6

回答by Srinivasreddy Jakkireddy

First open the file in read mode.Then use readlines() method to read line by line.All the lines stored in a list.Now you can use list slices to get first and last lines of the file.

首先以读取模式打开文件。然后使用 readlines() 方法逐行读取。所有行存储在列表中。现在您可以使用列表切片来获取文件的第一行和最后一行。

    a=open('file.txt','rb')
    lines = a.readlines()
    if lines:
        first_line = lines[:1]
        last_line = lines[-1]

Answer 7

回答by VipeR

w=open(file.txt, 'r')
print ('first line is : ',w.readline())
for line in w:  
    x= line
print ('last line is : ',x)
w.close()

The forloop runs through the lines and xgets the last line on the final iteration.

该for环通过线运行，并x获得在最后一次迭代的最后一行。

Answer 8

回答by Riccardo Volpe

with open("myfile.txt") as f:
    lines = f.readlines()
    first_row = lines[0]
    print first_row
    last_row = lines[-1]
    print last_row

Answer 9

回答by Marco Sulla

This is my solution, compatible also with Python3. It does also manage border cases, but it misses utf-16 support:

这是我的解决方案，也与 Python3 兼容。它也管理边界情况，但它错过了 utf-16 支持：

def tail(filepath):
    """
    @author Marco Sulla ([email protected])
    @date May 31, 2016
    """

    try:
        filepath.is_file
        fp = str(filepath)
    except AttributeError:
        fp = filepath

    with open(fp, "rb") as f:
        size = os.stat(fp).st_size
        start_pos = 0 if size - 1 < 0 else size - 1

        if start_pos != 0:
            f.seek(start_pos)
            char = f.read(1)

            if char == b"\n":
                start_pos -= 1
                f.seek(start_pos)

            if start_pos == 0:
                f.seek(start_pos)
            else:
                char = ""

                for pos in range(start_pos, -1, -1):
                    f.seek(pos)

                    char = f.read(1)

                    if char == b"\n":
                        break

        return f.readline()

It's ispired by Trasp's answerand AnotherParker's comment.

它是由Trasp's answer和AnotherParker 's comment 启发的。

Answer 10

回答by tony_tiger

Here is an extension of @Trasp's answer that has additional logic for handling the corner case of a file that has only one line. It may be useful to handle this case if you repeatedly want to read the last line of a file that is continuously being updated. Without this, if you try to grab the last line of a file that has just been created and has only one line, IOError: [Errno 22] Invalid argumentwill be raised.

这是@Trasp 答案的扩展，它具有用于处理只有一行的文件的特殊情况的附加逻辑。如果您反复想要读取不断更新的文件的最后一行，处理这种情况可能会很有用。如果没有这个，如果你试图抓取刚刚创建的文件的最后一行并且只有一行，IOError: [Errno 22] Invalid argument将会引发。

def tail(filepath):
    with open(filepath, "rb") as f:
        first = f.readline()      # Read the first line.
        f.seek(-2, 2)             # Jump to the second last byte.
        while f.read(1) != b"\n": # Until EOL is found...
            try:
                f.seek(-2, 1)     # ...jump back the read byte plus one more.
            except IOError:
                f.seek(-1, 1)
                if f.tell() == 0:
                    break
        last = f.readline()       # Read last line.
    return last

Python 获取文本文件的第一行和最后一行的最有效方法是什么？

提问by pasbino

采纳答案by SilentGhost

回答by beitar

回答by msw

回答by mik01aj

回答by Trasp

Efficiency

效率

Amendment

修正案

回答by Srinivasreddy Jakkireddy

回答by VipeR

回答by Riccardo Volpe

回答by Marco Sulla

回答by tony_tiger

相关推荐

最近更新

标签

Python 获取文本文件的第一行和最后一行的最有效方法是什么？

提问by pasbino

采纳答案by SilentGhost

回答by beitar

回答by msw

回答by mik01aj

回答by Trasp

Efficiency

效率

Amendment

修正案

回答by Srinivasreddy Jakkireddy

回答by VipeR

回答by Riccardo Volpe

回答by Marco Sulla

回答by tony_tiger

相关推荐

Python 创建带时间戳的文件夹

Python：在其中的模块和类之间共享全局变量

Python 创建 Django 模型或更新（如果存在）

Python 未使用 pexpect 超时，仅使用默认值 30

相关推荐

最近更新

标签