Python 解析文本文件中的数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17105456/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 00:28:24  来源:igfitidea点击:

Parsing data from text file

pythonfileparsing

提问by Roman Rdgz

I have a text file that has content like this:

我有一个文本文件,内容如下:

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

And then an empty line, and repeats more similar blocks, all of them with the same data fields.

然后是一个空行,并重复更多相似的块,它们都具有相同的数据字段。

I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!

我正在将 C++ 代码移植到 Python 中,某个部分逐行获取文件,检测文本标题,然后检测每个字段文本以提取数据。这看起来一点也不像智能代码,我认为 Python 必须有一些库来轻松解析这样的数据。毕竟,它几乎看起来像一个 CSV!

Any idea for this?

对此有什么想法吗?

采纳答案by Martijn Pieters

It is very far from CSV, actually.

实际上,它与 CSV 相去甚远。

You can use the file as an iterator; the following generator function yields complete sections:

您可以将该文件用作迭代器;以下生成器函数生成完整的部分:

def load_sections(filename):
    with open(filename, 'r') as infile:
        line = ''
        while True:
            while not line.startswith('****'): 
                line = next(infile)  # raises StopIteration, ending the generator
                continue  # find next entry

            entry = {}
            for line in infile:
                line = line.strip()
                if not line: break

                key, value = map(str.strip, line.split(':', 1))
                entry[key] = value

            yield entry

This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner whileand forloops do all the real work; first skip lines until a ****header section is found (otherwise discarded), then loop over all non-empty lines to create a section.

这将文件视为迭代器,这意味着任何循环都会将文件推进到下一行。外环只用于段与段之间的移动;内部whilefor循环完成所有真正的工作;首先跳过行直到****找到标题部分(否则将被丢弃),然后遍历所有非空行以创建一个部分。

Use the function in a loop:

在循环中使用该函数:

for section in load_sections(filename):
    print section

Repeating your sample data in a text file results in:

在文本文件中重复您的示例数据会导致:

>>> for section in load_sections('/tmp/test.txt'):
...     print section
... 
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}

You can add some data converters to that if you want to; a mapping of key to callable would do:

如果你愿意,你可以添加一些数据转换器;key 到 callable 的映射会做:

converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}

then in the generator function, instead of entry[key] = valuedo entry[key] = converters.get(key, lambda v: v)(value).

然后在生成器函数中,而不是entry[key] = valuedo entry[key] = converters.get(key, lambda v: v)(value)

回答by Peter Varo

my_file:

我的文件:

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

ID:                  02
Data1:               0.18348674325E-012
Data2:              10.9598489301
Data3:              0.0
Data4:                5748

ID:                  03
Data1:               20.1834869385E-002
Data2:              10.954576354
Data3:              10.13476858762435E+001
Data4:                7456

Python script:

蟒蛇脚本:

import re

with open('my_file', 'r') as f:
    data  = list()
    group = dict()
    for key, value in re.findall(r'(.*):\s*([\dE+-.]+)', f.read()):
        if key in group:
            data.append(group)
            group = dict()
        group[key] = value
    data.append(group)

print data

Printed output:

打印输出:

[
    {
        'Data4': '715',
        'Data1': '0.1834869385E-002',
        'ID': '01',
        'Data3': '-0.1091356549E+001',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '5748',
        'Data1': '0.18348674325E-012',
        'ID': '02',
        'Data3': '0.0',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '7456',
        'Data1': '20.1834869385E-002',
        'ID': '03',
        'Data3': '10.13476858762435E+001',
        'Data2': '10.954576354'
    }
]

回答by 6502

A very simple approach could be

一个非常简单的方法可能是

all_objects = []

with open("datafile") as f:
    for L in f:
        if L[:3] == "***":
            # Line starts with asterisks, create a new object
            all_objects.append({})
        elif ":" in L:
            # Line is a key/value field, update current object
            k, v = map(str.strip, L.split(":", 1))
            all_objects[-1][k] = v