Python 解析文本文件中的数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17105456/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing data from text file
提问by Roman Rdgz
I have a text file that has content like this:
我有一个文本文件,内容如下:
******** ENTRY 01 ********
ID: 01
Data1: 0.1834869385E-002
Data2: 10.9598489301
Data3: -0.1091356549E+001
Data4: 715
And then an empty line, and repeats more similar blocks, all of them with the same data fields.
然后是一个空行,并重复更多相似的块,它们都具有相同的数据字段。
I am porting to Python a C++ code, and a certain part gets the file line by line, detects the text title and then detect each field text to extract the data. This doesn't look like a smart code at all, and I think Python must have some library to parse data like this easily. After all, it almost look like a CSV!
我正在将 C++ 代码移植到 Python 中,某个部分逐行获取文件,检测文本标题,然后检测每个字段文本以提取数据。这看起来一点也不像智能代码,我认为 Python 必须有一些库来轻松解析这样的数据。毕竟,它几乎看起来像一个 CSV!
Any idea for this?
对此有什么想法吗?
采纳答案by Martijn Pieters
It is very far from CSV, actually.
实际上,它与 CSV 相去甚远。
You can use the file as an iterator; the following generator function yields complete sections:
您可以将该文件用作迭代器;以下生成器函数生成完整的部分:
def load_sections(filename):
with open(filename, 'r') as infile:
line = ''
while True:
while not line.startswith('****'):
line = next(infile) # raises StopIteration, ending the generator
continue # find next entry
entry = {}
for line in infile:
line = line.strip()
if not line: break
key, value = map(str.strip, line.split(':', 1))
entry[key] = value
yield entry
This treats the file as an iterator, meaning that any looping advances the file to the next line. The outer loop only serves to move from section to section; the inner whileand forloops do all the real work; first skip lines until a ****header section is found (otherwise discarded), then loop over all non-empty lines to create a section.
这将文件视为迭代器,这意味着任何循环都会将文件推进到下一行。外环只用于段与段之间的移动;内部while和for循环完成所有真正的工作;首先跳过行直到****找到标题部分(否则将被丢弃),然后遍历所有非空行以创建一个部分。
Use the function in a loop:
在循环中使用该函数:
for section in load_sections(filename):
print section
Repeating your sample data in a text file results in:
在文本文件中重复您的示例数据会导致:
>>> for section in load_sections('/tmp/test.txt'):
... print section
...
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
{'Data4': '715', 'Data1': '0.1834869385E-002', 'ID': '01', 'Data3': '-0.1091356549E+001', 'Data2': '10.9598489301'}
You can add some data converters to that if you want to; a mapping of key to callable would do:
如果你愿意,你可以添加一些数据转换器;key 到 callable 的映射会做:
converters = {'ID': int, 'Data1': float, 'Data2': float, 'Data3': float, 'Data4': int}
then in the generator function, instead of entry[key] = valuedo entry[key] = converters.get(key, lambda v: v)(value).
然后在生成器函数中,而不是entry[key] = valuedo entry[key] = converters.get(key, lambda v: v)(value)。
回答by Peter Varo
my_file:
我的文件:
******** ENTRY 01 ********
ID: 01
Data1: 0.1834869385E-002
Data2: 10.9598489301
Data3: -0.1091356549E+001
Data4: 715
ID: 02
Data1: 0.18348674325E-012
Data2: 10.9598489301
Data3: 0.0
Data4: 5748
ID: 03
Data1: 20.1834869385E-002
Data2: 10.954576354
Data3: 10.13476858762435E+001
Data4: 7456
Python script:
蟒蛇脚本:
import re
with open('my_file', 'r') as f:
data = list()
group = dict()
for key, value in re.findall(r'(.*):\s*([\dE+-.]+)', f.read()):
if key in group:
data.append(group)
group = dict()
group[key] = value
data.append(group)
print data
Printed output:
打印输出:
[
{
'Data4': '715',
'Data1': '0.1834869385E-002',
'ID': '01',
'Data3': '-0.1091356549E+001',
'Data2': '10.9598489301'
},
{
'Data4': '5748',
'Data1': '0.18348674325E-012',
'ID': '02',
'Data3': '0.0',
'Data2': '10.9598489301'
},
{
'Data4': '7456',
'Data1': '20.1834869385E-002',
'ID': '03',
'Data3': '10.13476858762435E+001',
'Data2': '10.954576354'
}
]
回答by 6502
A very simple approach could be
一个非常简单的方法可能是
all_objects = []
with open("datafile") as f:
for L in f:
if L[:3] == "***":
# Line starts with asterisks, create a new object
all_objects.append({})
elif ":" in L:
# Line is a key/value field, update current object
k, v = map(str.strip, L.split(":", 1))
all_objects[-1][k] = v

