Python 解析文本文件中的数据

Question

提问by Roman Rdgz

I have a text file that has content like this:

我有一个文本文件，内容如下：

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

And then an empty line, and repeats more similar blocks, all of them with the same data fields.

然后是一个空行，并重复更多相似的块，它们都具有相同的数据字段。

I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!

我正在将 C++ 代码移植到 Python 中，某个部分逐行获取文件，检测文本标题，然后检测每个字段文本以提取数据。这看起来一点也不像智能代码，我认为 Python 必须有一些库来轻松解析这样的数据。毕竟，它几乎看起来像一个 CSV！

Any idea for this?

对此有什么想法吗？

Answer 1

采纳答案by Martijn Pieters

It is very far from CSV, actually.

实际上，它与 CSV 相去甚远。

You can use the file as an iterator; the following generator function yields complete sections:

您可以将该文件用作迭代器；以下生成器函数生成完整的部分：

def load_sections(filename):
    with open(filename, 'r') as infile:
        line = ''
        while True:
            while not line.startswith('****'): 
                line = next(infile)  # raises StopIteration, ending the generator
                continue  # find next entry

            entry = {}
            for line in infile:
                line = line.strip()
                if not line: break

                key, value = map(str.strip, line.split(':', 1))
                entry[key] = value

            yield entry

This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner whileand forloops do all the real work; first skip lines until a ****header section is found (otherwise discarded), then loop over all non-empty lines to create a section.

这将文件视为迭代器，这意味着任何循环都会将文件推进到下一行。外环只用于段与段之间的移动；内部while和for循环完成所有真正的工作；首先跳过行直到****找到标题部分（否则将被丢弃），然后遍历所有非空行以创建一个部分。

Use the function in a loop:

在循环中使用该函数：

for section in load_sections(filename):
    print section

Repeating your sample data in a text file results in:

在文本文件中重复您的示例数据会导致：

>>> for section in load_sections('/tmp/test.txt'):
...     print section
... 
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}

You can add some data converters to that if you want to; a mapping of key to callable would do:

如果你愿意，你可以添加一些数据转换器；key 到 callable 的映射会做：

converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}

then in the generator function, instead of entry[key] = valuedo entry[key] = converters.get(key, lambda v: v)(value).

然后在生成器函数中，而不是entry[key] = valuedo entry[key] = converters.get(key, lambda v: v)(value)。

Answer 2

回答by Peter Varo

my_file:

我的文件：

******** ENTRY 01 ********
ID:                  01
Data1:               0.1834869385E-002
Data2:              10.9598489301
Data3:              -0.1091356549E+001
Data4:                715

ID:                  02
Data1:               0.18348674325E-012
Data2:              10.9598489301
Data3:              0.0
Data4:                5748

ID:                  03
Data1:               20.1834869385E-002
Data2:              10.954576354
Data3:              10.13476858762435E+001
Data4:                7456

Python script:

蟒蛇脚本：

import re

with open('my_file', 'r') as f:
    data  = list()
    group = dict()
    for key, value in re.findall(r'(.*):\s*([\dE+-.]+)', f.read()):
        if key in group:
            data.append(group)
            group = dict()
        group[key] = value
    data.append(group)

print data

Printed output:

打印输出：

[
    {
        'Data4': '715',
        'Data1': '0.1834869385E-002',
        'ID': '01',
        'Data3': '-0.1091356549E+001',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '5748',
        'Data1': '0.18348674325E-012',
        'ID': '02',
        'Data3': '0.0',
        'Data2': '10.9598489301'
    },
    {
        'Data4': '7456',
        'Data1': '20.1834869385E-002',
        'ID': '03',
        'Data3': '10.13476858762435E+001',
        'Data2': '10.954576354'
    }
]

Answer 3

回答by 6502

A very simple approach could be

一个非常简单的方法可能是

all_objects = []

with open("datafile") as f:
    for L in f:
        if L[:3] == "***":
            # Line starts with asterisks, create a new object
            all_objects.append({})
        elif ":" in L:
            # Line is a key/value field, update current object
            k, v = map(str.strip, L.split(":", 1))
            all_objects[-1][k] = v

Python 解析文本文件中的数据

提问by Roman Rdgz

采纳答案by Martijn Pieters

回答by Peter Varo

回答by 6502

相关推荐

最近更新

标签

Python 解析文本文件中的数据

提问by Roman Rdgz

采纳答案by Martijn Pieters

回答by Peter Varo

回答by 6502

相关推荐

错误 32，Python，另一个进程正在使用文件

从目录导入图像 (Python)

Python Javascript - 请求的资源上不存在“Access-Control-Allow-Origin”标头

Python 在 Pandas DataFrame 中用 None 替换无效值

相关推荐

最近更新

标签