Python 如何高效解析定宽文件？

Question

提问by hyperboreean

I am trying to find an efficient way of parsing files that holds fixed width lines. For example, the first 20 characters represent a column, from 21:30 another one and so on.

我试图找到一种解析包含固定宽度行的文件的有效方法。例如，前 20 个字符代表一列，从 21:30 开始，以此类推。

Assuming that the line holds 100 characters, what would be an efficient way to parse a line into several components?

假设该行包含 100 个字符，将一行解析为多个组件的有效方法是什么？

I could use string slicing per line, but it's a little bit ugly if the line is big. Are there any other fast methods?

我可以每行使用字符串切片，但如果行很大，它有点难看。还有其他快速的方法吗？

Answer 1

采纳答案by martineau

Using the Python standard library's structmodule would be fairly easy as well as extremely fast since it's written in C.

使用 Python 标准库的struct模块相当容易，而且速度极快，因为它是用 C 编写的。

Here's how it could be used to do what you want. It also allows columns of characters to be skipped by specifying negative values for the number of characters in the field.

这是如何使用它来做你想做的事。它还允许通过为字段中的字符数指定负值来跳过字符列。

import struct

fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                        for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from
print('fmtstring: {!r}, recsize: {} chars'.format(fmtstring, fieldstruct.size))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

Output:

输出：

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

The following modifications would adapt it work in Python 2 or 3 (and handle Unicode input):

以下修改将使其适用于 Python 2 或 3（并处理 Unicode 输入）：

import sys

fieldstruct = struct.Struct(fmtstring)
if sys.version_info[0] < 3:
    parse = fieldstruct.unpack_from
else:
    # converts unicode input to byte string and results back to unicode string
    unpack = fieldstruct.unpack_from
    parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

Here's a way to do it with string slices, as you were considering but were concerned that it might get too ugly. The nice thing about it is, besides not being all that ugly, is that it works unchanged in both Python 2 and 3, as well as being able to handle Unicode strings. I haven't benchmarked it, but suspect it might be competitive with the structmodule version speedwise. It could be sped-up slightly by removing the ability to have padding fields.

这是一种使用字符串切片的方法，正如您正在考虑的那样，但担心它可能会变得太丑陋。它的好处是，除了不是那么难看之外，它在 Python 2 和 3 中都保持不变，并且能够处理 Unicode 字符串。我没有对它进行基准测试，但怀疑它可能会与struct模块版本的速度竞争。通过删除具有填充字段的功能，可以稍微加快速度。

try:
    from itertools import izip_longest  # added in Py 2.6
except ImportError:
    from itertools import zip_longest as izip_longest  # name change in Py 3.x

try:
    from itertools import accumulate  # added in Py 3.2
except ImportError:
    def accumulate(iterable):
        'Return running totals (simplified version).'
        total = next(iterable)
        yield total
        for value in iterable:
            total += value
            yield total

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
    flds = tuple(izip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
    # optional informational function attributes
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

Output:

输出：

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

Answer 2

回答by Reiner Gerecke

I'm not really sure if this is efficient, but it should be readable (as opposed to do the slicing manually). I defined a function slicesthat gets a string and column lengths, and returns the substrings. I made it a generator, so for really long lines, it doesn't build a temporary list of substrings.

我不确定这是否有效，但它应该是可读的（而不是手动切片）。我定义了一个slices获取字符串和列长度并返回子字符串的函数。我把它变成了一个生成器，所以对于很长的行，它不会构建一个临时的子字符串列表。

def slices(s, *args):
    position = 0
    for length in args:
        yield s[position:position + length]
        position += length

Example

例子

In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2))
Out[32]: ['ab']

In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50))
Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789']

In [51]: d,c,h = slices('dogcathouse', 3, 3, 5)
In [52]: d,c,h
Out[52]: ('dog', 'cat', 'house')

But I think the advantage of a generator is lost if you need all columns at once. Where one could benefit from is when you want to process columns one by one, say in a loop.

但是我认为如果您一次需要所有列，生成器的优势就会消失。可以从中受益的地方是当您想要一一处理列时，例如在循环中。

Answer 3

回答by John Machin

The code below gives a sketch of what you might want to do if you have some serious fixed-column-width file handling to do.

如果您有一些严肃的固定列宽文件处理要做，下面的代码给出了您可能想要做什么的草图。

"Serious" = multiple record types in each of multiple file types, records up to 1000 bytes, the layout-definer and "opposing" producer/consumer is a government department with attitude, layout changes result in unused columns, up to a million records in a file, ...

“严重”=多个文件类型中的每个记录类型，记录最多1000字节，布局定义者和“反对”生产者/消费者是有态度的政府部门，布局更改导致未使用的列，最多百万条记录在一个文件中，...

Features: Precompiles the struct formats. Ignores unwanted columns. Converts input strings to required data types (sketch omits error handling). Converts records to object instances (or dicts, or named tuples if you prefer).

特点：预编译结构格式。忽略不需要的列。将输入字符串转换为所需的数据类型（草图省略错误处理）。将记录转换为对象实例（或字典或命名元组，如果您愿意）。

Code:

代码：

import struct, datetime, io, pprint

# functions for converting input fields to usable data
cnv_text = rstrip
cnv_int = int
cnv_date_dmy = lambda s: datetime.datetime.strptime(s, "%d%m%Y") # ddmmyyyy
# etc

# field specs (field name, start pos (1-relative), len, converter func)
fieldspecs = [
    ('surname', 11, 20, cnv_text),
    ('given_names', 31, 20, cnv_text),
    ('birth_date', 51, 8, cnv_date_dmy),
    ('start_date', 71, 8, cnv_date_dmy),
    ]

fieldspecs.sort(key=lambda x: x[1]) # just in case

# build the format for struct.unpack
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
    start = fieldspec[1] - 1
    end = start + fieldspec[2]
    if start > unpack_len:
        unpack_fmt += str(start - unpack_len) + "x"
    unpack_fmt += str(end - start) + "s"
    unpack_len = end
field_indices = range(len(fieldspecs))
print unpack_len, unpack_fmt
unpacker = struct.Struct(unpack_fmt).unpack_from

class Record(object):
    pass
    # or use named tuples

raw_data = """\
....v....1....v....2....v....3....v....4....v....5....v....6....v....7....v....8
          Featherstonehaugh   Algernon Marmaduke  31121969            01012005XX
"""

f = cStringIO.StringIO(raw_data)
headings = f.next()
for line in f:
    # The guts of this loop would of course be hidden away in a function/method
    # and could be made less ugly
    raw_fields = unpacker(line)
    r = Record()
    for x in field_indices:
        setattr(r, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
    pprint.pprint(r.__dict__)
    print "Customer name:", r.given_names, r.surname

Output:

输出：

78 10x20s20s8s12x8s
{'birth_date': datetime.datetime(1969, 12, 31, 0, 0),
 'given_names': 'Algernon Marmaduke',
 'start_date': datetime.datetime(2005, 1, 1, 0, 0),
 'surname': 'Featherstonehaugh'}
Customer name: Algernon Marmaduke Featherstonehaugh

Answer 4

回答by sten

> str = '1234567890'
> w = [0,2,5,7,10]
> [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ]
['12', '345', '67', '890']

Answer 5

回答by Tom M

Two more options that are easier and prettier than already mentioned solutions:

还有两个比已经提到的解决方案更简单、更漂亮的选项：

The first is using pandas:

第一个是使用熊猫：

import pandas as pd

path = 'filename.txt'

# Using Pandas with a column specification
col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)]
data = pd.read_fwf(path, colspecs=col_specification)

And the second option using numpy.loadtxt:

使用 numpy.loadtxt 的第二个选项：

import numpy as np

# Using NumPy and letting it figure it out automagically
data_also = np.loadtxt(path)

It really depends on in what way you want to use your data.

这实际上取决于您希望以何种方式使用您的数据。

Answer 6

回答by Brian Burns

Here's a simple module for Python 3, based on John Machin's answer- adapt as needed :)

这是 Python 3 的一个简单模块，基于John Machin 的回答- 根据需要进行调整 :)

"""
fixedwidth

Parse and iterate through a fixedwidth text file, returning record objects.

Adapted from https://stackoverflow.com/a/4916375/243392


USAGE

    import fixedwidth, pprint

    # define the fixed width fields we want
    # fieldspecs is a list of [name, description, start, width, type] arrays.
    fieldspecs = [
        ["FILEID", "File Identification", 1, 6, "A/N"],
        ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
        ["SUMLEV", "Summary Level", 9, 3, "A/N"],
        ["LOGRECNO", "Logical Record Number", 19, 7, "N"],
        ["POP100", "Population Count (100%)", 30, 9, "N"],
    ]

    # define the fieldtype conversion functions
    fieldtype_fns = {
        'A': str.rstrip,
        'A/N': str.rstrip,
        'N': int,
    }

    # iterate over record objects in the file
    with open(f, 'rb'):
        for record in fixedwidth.reader(f, fieldspecs, fieldtype_fns):
            pprint.pprint(record.__dict__)

    # output:
    {'FILEID': 'SF1ST', 'LOGRECNO': 2, 'POP100': 1, 'STUSAB': 'TX', 'SUMLEV': '040'}
    {'FILEID': 'SF1ST', 'LOGRECNO': 3, 'POP100': 2, 'STUSAB': 'TX', 'SUMLEV': '040'}    
    ...

"""

import struct, io


# fieldspec columns
iName, iDescription, iStart, iWidth, iType = range(5)


def get_struct_unpacker(fieldspecs):
    """
    Build the format string for struct.unpack to use, based on the fieldspecs.
    fieldspecs is a list of [name, description, start, width, type] arrays.
    Returns a string like "6s2s3s7x7s4x9s".
    """
    unpack_len = 0
    unpack_fmt = ""
    for fieldspec in fieldspecs:
        start = fieldspec[iStart] - 1
        end = start + fieldspec[iWidth]
        if start > unpack_len:
            unpack_fmt += str(start - unpack_len) + "x"
        unpack_fmt += str(end - start) + "s"
        unpack_len = end
    struct_unpacker = struct.Struct(unpack_fmt).unpack_from
    return struct_unpacker


class Record(object):
    pass
    # or use named tuples


def reader(f, fieldspecs, fieldtype_fns):
    """
    Wrap a fixedwidth file and return records according to the given fieldspecs.
    fieldspecs is a list of [name, description, start, width, type] arrays.
    fieldtype_fns is a dictionary of functions used to transform the raw string values, 
    one for each type.
    """

    # make sure fieldspecs are sorted properly
    fieldspecs.sort(key=lambda fieldspec: fieldspec[iStart])

    struct_unpacker = get_struct_unpacker(fieldspecs)

    field_indices = range(len(fieldspecs))

    for line in f:
        raw_fields = struct_unpacker(line) # split line into field values
        record = Record()
        for i in field_indices:
            fieldspec = fieldspecs[i]
            fieldname = fieldspec[iName]
            s = raw_fields[i].decode() # convert raw bytes to a string
            fn = fieldtype_fns[fieldspec[iType]] # get conversion function
            value = fn(s) # convert string to value (eg to an int)
            setattr(record, fieldname, value)
        yield record


if __name__=='__main__':

    # test module

    import pprint, io

    # define the fields we want
    # fieldspecs are [name, description, start, width, type]
    fieldspecs = [
        ["FILEID", "File Identification", 1, 6, "A/N"],
        ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
        ["SUMLEV", "Summary Level", 9, 3, "A/N"],
        ["LOGRECNO", "Logical Record Number", 19, 7, "N"],
        ["POP100", "Population Count (100%)", 30, 9, "N"],
    ]

    # define a conversion function for integers
    def to_int(s):
        """
        Convert a numeric string to an integer.
        Allows a leading ! as an indicator of missing or uncertain data.
        Returns None if no data.
        """
        try:
            return int(s)
        except:
            try:
                return int(s[1:]) # ignore a leading !
            except:
                return None # assume has a leading ! and no value

    # define the conversion fns
    fieldtype_fns = {
        'A': str.rstrip,
        'A/N': str.rstrip,
        'N': to_int,
        # 'N': int,
        # 'D': lambda s: datetime.datetime.strptime(s, "%d%m%Y"), # ddmmyyyy
        # etc
    }

    # define a fixedwidth sample
    sample = """\
SF1ST TX04089000  00000023748        1 
SF1ST TX04090000  00000033748!       2
SF1ST TX04091000  00000043748!        
"""
    sample_data = sample.encode() # convert string to bytes
    file_like = io.BytesIO(sample_data) # create a file-like wrapper around bytes

    # iterate over record objects in the file
    for record in reader(file_like, fieldspecs, fieldtype_fns):
        # print(record)
        pprint.pprint(record.__dict__)

Answer 7

回答by YeO

Here is what NumPy uses under the hood (much much simplified, but still - this code is found in the LineSplitter classwithin the _iotools module):

以下是NumPy的引擎盖下使用（很多很多简化，但仍-这个代码是在发现LineSplitter class中_iotools module）：

import numpy as np

DELIMITER = (20, 10, 10, 20, 10, 10, 20)

idx = np.cumsum([0] + list(DELIMITER))
slices = [slice(i, j) for (i, j) in zip(idx[:-1], idx[1:])]

def parse(line):
    return [line[s] for s in slices]

It does not handle negative delimiters for ignoring column so it is not as versatile as struct, but it is faster.

它不处理忽略列的负分隔符，因此它不像那样通用struct，但速度更快。

Answer 8

回答by MatrixManAtYrService

String slicing doesn't have to be ugly as long as you keep it organized. Consider storing your field widths in a dictionary and then using the associated names to create an object:

只要您保持组织有序，字符串切片就不必难看。考虑将您的字段宽度存储在字典中，然后使用关联的名称创建一个对象：

from collections import OrderedDict

class Entry:
    def __init__(self, line):

        name2width = OrderedDict()
        name2width['foo'] = 2
        name2width['bar'] = 3
        name2width['baz'] = 2

        pos = 0
        for name, width in name2width.items():

            val = line[pos : pos + width]
            if len(val) != width:
                raise ValueError("not enough characters: \'{}\'".format(line))

            setattr(self, name, val)
            pos += width

file = "ab789yz\ncd987wx\nef555uv"

entry = []

for line in file.split('\n'):
    entry.append(Entry(line))

print(entry[1].bar) # output: 987

Answer 9

回答by notback

Because my old work often handles 1 million lines of fixwidth data, I did research on this issue when I started using Python.

因为我以前的工作经常处理100万行固定宽度的数据，所以我在开始使用Python时就对这个问题进行了研究。

There are 2 types of FixedWidth

有两种类型的固定宽度

ASCII FixedWidth (ascii character length = 1, double-byte encoded character length = 2)
Unicode FixedWidth (ascii character & double-byte encoded character length = 1)

ASCII FixedWidth（ASCII字符长度=1，双字节编码字符长度=2）
Unicode FixedWidth（ASCII 字符和双字节编码字符长度 = 1）

If the resource string is all composed of ascii characters, then ASCII FixedWidth = Unicode FixedWidth

如果资源字符串全部由ascii字符组成，则ASCII FixedWidth = Unicode FixedWidth

Fortunately, string and byte are different in py3, which reduces a lot of confusion when dealing with double-byte encoded characters (e.g.gbk, big5, euc-jp, shift-jis, etc.).
For the processing of "ASCII FixedWidth", the String is usually converted to Bytes and then split.

好在py3中string和byte是不同的，这样在处理双字节编码的字符（eggbk、big5、euc-jp、shift-jis等）时减少了很多混乱。
对于“ASCII FixedWidth”的处理，通常将String转换为Bytes，然后进行拆分。

Without importing third-party modules
totalLineCount = 1 million, lineLength = 800 byte , FixedWidthArgs=(10,25,4,....), I split the Line in about 5 ways and get the following conclusion:

在不导入第三方模块的情况下，
totalLineCount = 1百万, lineLength = 800 byte , FixedWidthArgs=(10,25,4,....)，我将 Line 拆分为大约 5 种方式，得到以下结论：

struct is the fastest (1x)
Loop only, not pre-processing FixedWidthArgs is the slowest (5x+)
slice(bytes)is faster than slice(string)
The source string is the bytes test result: struct(1x) , operator.itemgetter(1.7x) , precompiled sliceObject & list comprehensions(2.8x), re.patten object (2.9x)

struct 是最快的 (1x)
仅循环，不预处理 FixedWidthArgs 是最慢的 (5x+)
slice(bytes)比 slice(string)
源字符串是字节测试结果： struct(1x) , operator.itemgetter(1.7x) , precompiled sliceObject & list comprehensions(2.8x), re.patten object (2.9x)

When dealing with large files, we often use with open ( file, "rb") as f:.
The method traverses one of the above files, about 2.4 second.
I think the appropriate handler, which processes 1 million rows of data, splits each row into 20 fields and takes less than 2.4 seconds.

在处理大文件时，我们经常使用with open ( file, "rb") as f:.
该方法遍历上述文件之一，大约2.4秒。
我认为处理 100 万行数据的适当处理程序将每一行拆分为 20 个字段，并且耗时不到 2.4 秒。

I only find that stuctand itemgettermeet the requirements

我只发现stuct并itemgetter满足要求

ps: For normal display, I converted unicode str to bytes. If you are in a double-byte environment, you don't need to do this.

ps：为了正常显示，我将 unicode str 转换为字节。如果您处于双字节环境中，则无需执行此操作。

from itertools import accumulate
from operator import itemgetter

def oprt_parser(sArgs):
    sum_arg = tuple(accumulate(abs(i) for i in sArgs))
    # Negative parameter field index
    cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
    # Get slice args and Ignore fields of negative length
    ig_Args = tuple(item for i, item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
    # Generate `operator.itemgetter` object
    oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
    return oprtObj

lineb = b'abcdefghijklmnopqrstuvwxyz\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4\xb6\xee\xb7\xa2\xb8\xf6\xba\xcd0123456789'
line = lineb.decode("GBK")

# Unicode Fixed Width
fieldwidthsU = (13, -13, 4, -4, 5,-5) # Negative width fields is ignored
# ASCII Fixed Width
fieldwidths = (13, -13, 8, -8, 5,-5) # Negative width fields is ignored
# Unicode FixedWidth processing
parse = oprt_parser(fieldwidthsU)
fields = parse(line)
print('Unicode FixedWidth','fields: {}'.format(tuple(map(lambda s: s.encode("GBK"), fields))))
# ASCII FixedWidth processing
parse = oprt_parser(fieldwidths)
fields = parse(lineb)
print('ASCII FixedWidth','fields: {}'.format(fields))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)
parse = oprt_parser(fieldwidths)
fields = parse(line)
print(f"fields: {fields}")

Output:

输出：

Unicode FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
ASCII FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

oprt_parseris 4x make_parser(list comprehensions + slice)

oprt_parser是 4x make_parser（列表推导式 + 切片）

During the research, it was found that when the cpu speed is faster, it seems that the efficiency of the remethod increases faster.
Since I don't have more and better computers to test, provide my test code, if anyone is interested, you can test it with a faster computer.

在研究过程中发现，当cpu速度越快时，该re方法的效率似乎提高得越快。
由于我没有更多更好的计算机来测试，提供我的测试代码，如果有人感兴趣，可以用更快的计算机测试。

Run Environment:

运行环境：

os:win10
python: 3.7.2
CPU:amd athlon x3 450
HD:seagate 1T

操作系统：win10
蟒蛇：3.7.2
CPU:amd athlon x3 450
高清：希捷1T

import timeit
import time
import re
from itertools import accumulate
from operator import itemgetter

def eff2(stmt,onlyNum= False,showResult=False):
    '''test function'''
    if onlyNum:
        rl = timeit.repeat(stmt=stmt,repeat=roundI,number=timesI,globals=globals())
        avg = sum(rl) / len(rl)
        return f"{avg * (10 ** 6)/timesI:0.4f}"
    else:
        rl = timeit.repeat(stmt=stmt,repeat=10,number=1000,globals=globals())
        avg = sum(rl) / len(rl)
        print(f"【{stmt}】")
        print(f"\tquick avg = {avg * (10 ** 6)/1000:0.4f} s/million")
        if showResult:
            print(f"\t  Result = {eval(stmt)}\n\t  timelist = {rl}\n")
        else:
            print("")

def upDouble(argList,argRate):
    return [c*argRate for c in argList]

tbStr = "000000001111000002222真2233333333000000004444444QAZ55555555000000006666666ABC这些事中文字abcdefghijk"
tbBytes = tbStr.encode("GBK")
a20 = (4,4,2,2,2,3,2,2, 2 ,2,8,8,7,3,8,8,7,3, 12 ,11)
a20U = (4,4,2,2,2,3,2,2, 1 ,2,8,8,7,3,8,8,7,3, 6 ,11)
Slng = 800
rateS = Slng // 100

tStr = "".join(upDouble(tbStr , rateS))
tBytes = tStr.encode("GBK")
spltArgs = upDouble( a20 , rateS)
spltArgsU = upDouble( a20U , rateS)

testList = []
timesI = 100000
roundI = 5
print(f"test round = {roundI} timesI = {timesI} sourceLng = {len(tStr)} argFieldCount = {len(spltArgs)}")


print(f"pure str \n{''.ljust(60,'-')}")
# ==========================================
def str_parser(sArgs):
    def prsr(oStr):
        r = []
        r_ap = r.append
        stt=0
        for lng in sArgs:
            end = stt + lng 
            r_ap(oStr[stt:end])
            stt = end 
        return tuple(r)
    return prsr

Str_P = str_parser(spltArgsU)
# eff2("Str_P(tStr)")
testList.append("Str_P(tStr)")

print(f"pure bytes \n{''.ljust(60,'-')}")
# ==========================================
def byte_parser(sArgs):
    def prsr(oBytes):
        r, stt = [], 0
        r_ap = r.append
        for lng in sArgs:
            end = stt + lng
            r_ap(oBytes[stt:end])
            stt = end
        return r
    return prsr
Byte_P = byte_parser(spltArgs)
# eff2("Byte_P(tBytes)")
testList.append("Byte_P(tBytes)")

# re,bytes
print(f"re compile object \n{''.ljust(60,'-')}")
# ==========================================


def rebc_parser(sArgs,otype="b"):
    re_Args = "".join([f"(.{{{n}}})" for n in sArgs])
    if otype == "b":
        rebc_Args = re.compile(re_Args.encode("GBK"))
    else:
        rebc_Args = re.compile(re_Args)
    def prsr(oBS):
        return rebc_Args.match(oBS).groups()
    return prsr
Rebc_P = rebc_parser(spltArgs)
# eff2("Rebc_P(tBytes)")
testList.append("Rebc_P(tBytes)")

Rebc_Ps = rebc_parser(spltArgsU,"s")
# eff2("Rebc_Ps(tStr)")
testList.append("Rebc_Ps(tStr)")


print(f"struct \n{''.ljust(60,'-')}")
# ==========================================

import struct
def struct_parser(sArgs):
    struct_Args = " ".join(map(lambda x: str(x) + "s", sArgs))
    def prsr(oBytes):
        return struct.unpack(struct_Args, oBytes)
    return prsr
Struct_P = struct_parser(spltArgs)
# eff2("Struct_P(tBytes)")
testList.append("Struct_P(tBytes)")

print(f"List Comprehensions + slice \n{''.ljust(60,'-')}")
# ==========================================
import itertools
def slice_parser(sArgs):
    tl = tuple(itertools.accumulate(sArgs))
    slice_Args = tuple(zip((0,)+tl,tl))
    def prsr(oBytes):
        return [oBytes[s:e] for s, e in slice_Args]
    return prsr
Slice_P = slice_parser(spltArgs)
# eff2("Slice_P(tBytes)")
testList.append("Slice_P(tBytes)")

def sliceObj_parser(sArgs):
    tl = tuple(itertools.accumulate(sArgs))
    tl2 = tuple(zip((0,)+tl,tl))
    sliceObj_Args = tuple(slice(s,e) for s,e in tl2)
    def prsr(oBytes):
        return [oBytes[so] for so in sliceObj_Args]
    return prsr
SliceObj_P = sliceObj_parser(spltArgs)
# eff2("SliceObj_P(tBytes)")
testList.append("SliceObj_P(tBytes)")

SliceObj_Ps = sliceObj_parser(spltArgsU)
# eff2("SliceObj_Ps(tStr)")
testList.append("SliceObj_Ps(tStr)")


print(f"operator.itemgetter + slice object \n{''.ljust(60,'-')}")
# ==========================================

def oprt_parser(sArgs):
    sum_arg = tuple(accumulate(abs(i) for i in sArgs))
    cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
    ig_Args = tuple(item for i,item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
    oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
    return oprtObj

Oprt_P = oprt_parser(spltArgs)
# eff2("Oprt_P(tBytes)")
testList.append("Oprt_P(tBytes)")

Oprt_Ps = oprt_parser(spltArgsU)
# eff2("Oprt_Ps(tStr)")
testList.append("Oprt_Ps(tStr)")

print("|".join([s.split("(")[0].center(11," ") for s in testList]))
print("|".join(["".center(11,"-") for s in testList]))
print("|".join([eff2(s,True).rjust(11," ") for s in testList]))

Output:

输出：

Test round = 5 timesI = 100000 sourceLng = 744 argFieldCount = 20
...
...
???Str_P | Byte_P | Rebc_P | Rebc_Ps | Struct_P | Slice_P | SliceObj_P|SliceObj_Ps| Oprt_P | Oprt_Ps
-----------|-----------|-----------|-----------|-- ---------|-----------|-----------|-----------|---- -------|-----------
?????9.6315| 7.5952| 4.4187| 5.6867| 1.5123| 5.2915| 4.2673| 5.7121| 2.4713| 3.9051

Answer 10

回答by vlyalcin

This is how I solved with a dictionary that contains where fields start and end. Giving start and end points helped me to manage changes at the length of the column also.

这就是我使用包含字段开始和结束位置的字典来解决的方法。给出起点和终点也帮助我管理专栏长度的变化。

# fixed length
#      '---------- ------- ----------- -----------'
line = '20.06.2019 myname  active      mydevice   '
SLICES = {'date_start': 0,
         'date_end': 10,
         'name_start': 11,
         'name_end': 18,
         'status_start': 19,
         'status_end': 30,
         'device_start': 31,
         'device_end': 42}

def get_values_as_dict(line, SLICES):
    values = {}
    key_list = {key.split("_")[0] for key in SLICES.keys()}
    for key in key_list:
       values[key] = line[SLICES[key+"_start"]:SLICES[key+"_end"]].strip()
    return values

>>> print (get_values_as_dict(line,SLICES))
{'status': 'active', 'name': 'myname', 'date': '20.06.2019', 'device': 'mydevice'}

Python 如何高效解析定宽文件？

提问by hyperboreean

采纳答案by martineau

回答by Reiner Gerecke

回答by John Machin

回答by sten

回答by Tom M

回答by Brian Burns

回答by YeO

回答by MatrixManAtYrService

回答by notback

回答by vlyalcin

相关推荐

最近更新

标签

Python 如何高效解析定宽文件？

提问by hyperboreean

采纳答案by martineau

回答by Reiner Gerecke

回答by John Machin

回答by sten

回答by Tom M

回答by Brian Burns

回答by YeO

回答by MatrixManAtYrService

回答by notback

回答by vlyalcin

相关推荐

Python：xml ElementTree（或 lxml）中的命名空间

Python 和 pip，列出可用包的所有版本？

如何为 Python 2.7 安装机械化？

Python 在远程 shell 中使用 Fabric 进行 run() 调用时，我可以捕获错误代码吗？

相关推荐

最近更新

标签