pandas 如何使用 Python 解析复杂的文本文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47982949/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse complex text files using Python?
提问by bluprince13
I'm looking for a simple way of parsing complex text files into a pandas DataFrame. Below is a sample file, what I want the result to look like after parsing, and my current method.
我正在寻找一种将复杂文本文件解析为 Pandas DataFrame 的简单方法。下面是一个示例文件,我希望解析后的结果是什么样的,以及我当前的方法。
Is there any way to make it more concise/faster/more pythonic/more readable?
有没有办法让它更简洁/更快/更pythonic/更易读?
I've also put this question on Code Review.
我也把这个问题放在Code Review 上。
I eventually wrote a blog article to explain this to beginners.
我最终写了一篇博客文章向初学者解释这一点。
Here is a sample file:
这是一个示例文件:
Sample text
A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.
School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel
Student number, Score
0, 3
1, 7
Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora
Student number, Score
0, 6
1, 3
2, 9
School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna
Student number, Score
0, 8
1, 7
Grade = 2
Student number, Name
0, Harry
1, Hermione
Student number, Score
0, 5
1, 10
Grade = 3
Student number, Name
0, Fred
1, George
Student number, Score
0, 0
1, 0
Here is what I want the result to look like after parsing:
这是我希望解析后的结果:
Name Score
School Grade Student number
Hogwarts 1 0 Ginny 8
1 Luna 7
2 0 Harry 5
1 Hermione 10
3 0 Fred 0
1 George 0
Riverdale High 1 0 Phoebe 3
1 Rachel 7
2 0 Angela 6
1 Tristan 3
2 Aurora 9
Here is how I currently parse it:
这是我目前解析它的方式:
import re
import pandas as pd
def parse(filepath):
"""
Parse text at given filepath
Parameters
----------
filepath : str
Filepath for file to be parsed
Returns
-------
data : pd.DataFrame
Parsed data
"""
data = []
with open(filepath, 'r') as file:
line = file.readline()
while line:
reg_match = _RegExLib(line)
if reg_match.school:
school = reg_match.school.group(1)
if reg_match.grade:
grade = reg_match.grade.group(1)
grade = int(grade)
if reg_match.name_score:
value_type = reg_match.name_score.group(1)
line = file.readline()
while line.strip():
number, value = line.strip().split(',')
value = value.strip()
dict_of_data = {
'School': school,
'Grade': grade,
'Student number': number,
value_type: value
}
data.append(dict_of_data)
line = file.readline()
line = file.readline()
data = pd.DataFrame(data)
data.set_index(['School', 'Grade', 'Student number'], inplace=True)
# consolidate df to remove nans
data = data.groupby(level=data.index.names).first()
# upgrade Score from float to integer
data = data.apply(pd.to_numeric, errors='ignore')
return data
class _RegExLib:
"""Set up regular expressions"""
# use https://regexper.com to visualise these if required
_reg_school = re.compile('School = (.*)\n')
_reg_grade = re.compile('Grade = (.*)\n')
_reg_name_score = re.compile('(Name|Score)')
def __init__(self, line):
# check whether line has a positive match with all of the regular expressions
self.school = self._reg_school.match(line)
self.grade = self._reg_grade.match(line)
self.name_score = self._reg_name_score.search(line)
if __name__ == '__main__':
filepath = 'sample.txt'
data = parse(filepath)
print(data)
回答by Jan
Update 2019 (PEG parser):
2019 年更新(PEG 解析器):
This answer has received quite some attention so I felt to add another possibility, namely a parsing option. Here we could use a PEG
parser instead (e.g. parsimonious
) in combination with a NodeVisitor
class:
这个答案受到了相当多的关注,所以我想添加另一种可能性,即解析选项。在这里,我们可以使用PEG
解析器代替(例如parsimonious
)与NodeVisitor
类结合使用:
from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import pandas as pd
grammar = Grammar(
r"""
schools = (school_block / ws)+
school_block = school_header ws grade_block+
grade_block = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws?
school_header = ~"^School = (.*)"m
grade_header = ~"^Grade = (\d+)"m
name_header = "Student number, Name"
score_header = "Student number, Score"
number_name = index comma name ws
number_score = index comma score ws
comma = ws? "," ws?
index = number+
score = number+
number = ~"\d+"
name = ~"[A-Z]\w+"
ws = ~"\s*"
"""
)
tree = grammar.parse(data)
class SchoolVisitor(NodeVisitor):
output, names = ([], [])
current_school, current_grade = None, None
def _getName(self, idx):
for index, name in self.names:
if index == idx:
return name
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_school_header(self, node, children):
self.current_school = node.match.group(1)
def visit_grade_header(self, node, children):
self.current_grade = node.match.group(1)
self.names = []
def visit_number_name(self, node, children):
index, name = None, None
for child in node.children:
if child.expr.name == 'name':
name = child.text
elif child.expr.name == 'index':
index = child.text
self.names.append((index, name))
def visit_number_score(self, node, children):
index, score = None, None
for child in node.children:
if child.expr.name == 'index':
index = child.text
elif child.expr.name == 'score':
score = child.text
name = self._getName(index)
# build the entire entry
entry = (self.current_school, self.current_grade, index, name, score)
self.output.append(entry)
sv = SchoolVisitor()
sv.visit(tree)
df = pd.DataFrame.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)
Regex option (original answer)
正则表达式选项(原始答案)
Well then, watching Lord of the Rings the xth time, I had to bridge some time to the very finale:
那么,第 x 次观看指环王,我不得不为大结局留出一些时间:
分解,我们的想法是将问题分解为几个较小的问题:
- Separate each school
- ... each grade
- ... student and scores
- ... bind them together in a dataframe afterwards
- 每个学校分开
- ...每个年级
- ...学生和分数
- ... 之后将它们绑定到一个数据框中
学校部分(见 a demo on regex101.comregex101.com 上的演示)
^
School\s*=\s*(?P<school_name>.+)
(?P<school_content>[\s\S]+?)
(?=^School|\Z)
成绩部分(another demo on regex101.comregex101.com 上的另一个演示)
^
Grade\s*=\s*(?P<grade>.+)
(?P<students>[\s\S]+?)
(?=^Grade|\Z)
学生/分数部分(last demo on regex101.comregex101.com 上的最后一个演示):
^
Student\ number,\ Name[\n\r]
(?P<student_names>(?:^\d+.+[\n\r])+)
\s*
^
Student\ number,\ Score[\n\r]
(?P<student_scores>(?:^\d+.+[\n\r])+)
The rest is a generator expression which is then fed into the DataFrame
constructor (along with the column names).
其余的是一个生成器表达式,然后将其送入DataFrame
构造函数(连同列名)。
编码:
import pandas as pd, re
rx_school = re.compile(r'''
^
School\s*=\s*(?P<school_name>.+)
(?P<school_content>[\s\S]+?)
(?=^School|\Z)
''', re.MULTILINE | re.VERBOSE)
rx_grade = re.compile(r'''
^
Grade\s*=\s*(?P<grade>.+)
(?P<students>[\s\S]+?)
(?=^Grade|\Z)
''', re.MULTILINE | re.VERBOSE)
rx_student_score = re.compile(r'''
^
Student\ number,\ Name[\n\r]
(?P<student_names>(?:^\d+.+[\n\r])+)
\s*
^
Student\ number,\ Score[\n\r]
(?P<student_scores>(?:^\d+.+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
result = ((school.group('school_name'), grade.group('grade'), student_number, name, score)
for school in rx_school.finditer(string)
for grade in rx_grade.finditer(school.group('school_content'))
for student_score in rx_student_score.finditer(grade.group('students'))
for student in zip(student_score.group('student_names')[:-1].split("\n"), student_score.group('student_scores')[:-1].split("\n"))
for student_number in [student[0].split(", ")[0]]
for name in [student[0].split(", ")[1]]
for score in [student[1].split(", ")[1]]
)
df = pd.DataFrame(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)
浓缩:
rx_school = re.compile(r'^School\s*=\s*(?P<school_name>.+)(?P<school_content>[\s\S]+?)(?=^School|\Z)', re.MULTILINE)
rx_grade = re.compile(r'^Grade\s*=\s*(?P<grade>.+)(?P<students>[\s\S]+?)(?=^Grade|\Z)', re.MULTILINE)
rx_student_score = re.compile(r'^Student number, Name[\n\r](?P<student_names>(?:^\d+.+[\n\r])+)\s*^Student number, Score[\n\r](?P<student_scores>(?:^\d+.+[\n\r])+)', re.MULTILINE)
这产生
School Grade Student number Name Score
0 Riverdale High 1 0 Phoebe 3
1 Riverdale High 1 1 Rachel 7
2 Riverdale High 2 0 Angela 6
3 Riverdale High 2 1 Tristan 3
4 Riverdale High 2 2 Aurora 9
5 Hogwarts 1 0 Ginny 8
6 Hogwarts 1 1 Luna 7
7 Hogwarts 2 0 Harry 5
8 Hogwarts 2 1 Hermione 10
9 Hogwarts 3 0 Fred 0
10 Hogwarts 3 1 George 0
至于 timing时间,这是运行一万次的结果:
import timeit
print(timeit.timeit(makedf, number=10**4))
# 11.918397722000009 s
回答by Ezer K
here is my suggestion using split and pd.concat ("txt" stands for a copy of the original text in the question), basicly the idea is to split by the group words and then concat into data frames, the most inner parsing takes advantage of the fact that the names and grades are in a csv like format. here goes:
这是我的建议使用 split 和 pd.concat (“txt”代表问题中原始文本的副本),基本上的想法是按组词拆分然后连接到数据帧中,最内部的解析利用事实上,姓名和成绩采用类似 csv 的格式。开始:
import pandas as pd
from io import StringIO
schools = txt.lower().split('school = ')
schools_dfs = []
for school in schools[1:]:
grades = school.split('grade = ')
grades_dfs = []
for grade in grades[1:]:
features = grade.split('student number,')
feature_dfs = []
for feature in features[1:]:
feature_dfs.append(pd.read_csv(StringIO(feature)))
feature_df = pd.concat(feature_dfs, axis=1)
feature_df['grade'] = features[0].replace('\n','')
grades_dfs.append(feature_df)
grades_df = pd.concat(grades_dfs)
grades_df['school'] = grades[0].replace('\n','')
schools_dfs.append(grades_df)
schools_df = pd.concat(schools_dfs)
schools_df.set_index(['school', 'grade'])
回答by spookylukey
I would suggest using a parser combinator library like parsy. Compared to using regexes, the result will not be as concise, but it will be much more readable and robust, while still being relatively light-weight.
我建议使用像parsy这样的解析器组合器库。与使用正则表达式相比,结果不会那么简洁,但会更具可读性和健壮性,同时仍然相对轻量级。
Parsing is in general quite a hard task, and an approach that is good for people at beginner level for general programming might be hard to find.
解析通常是一项非常艰巨的任务,并且可能很难找到适合初学者级别的通用编程人员的方法。
EDIT:
Some actual example code that does minimal parsing of your supplied example. It does not pass to pandas, or even match up names to scores, or students to grades etc. - it just returns a hierarchy of objects starting with School
at the top, with the relevant attributes as you would expect:
编辑:一些实际的示例代码,对您提供的示例进行最少的解析。它不会传递给Pandas,甚至不会将姓名与分数、学生与成绩等进行匹配——它只返回从School
顶部开始的对象层次结构,以及您所期望的相关属性:
from parsy import string, regex, seq
import attr
@attr.s
class Student():
name = attr.ib()
number = attr.ib()
@attr.s
class Score():
score = attr.ib()
number = attr.ib()
@attr.s
class Grade():
grade = attr.ib()
students = attr.ib()
scores = attr.ib()
@attr.s
class School():
name = attr.ib()
grades = attr.ib()
integer = regex(r"\d+").map(int)
student_number = integer
score = integer
student_name = regex(r"[^\n]+")
student_def = seq(student_number.tag('number') << string(", "),
student_name.tag('name') << string("\n")).combine_dict(Student)
student_def_list = string("Student number, Name\n") >> student_def.many()
score_def = seq(student_number.tag('number') << string(", "),
score.tag('score') << string("\n")).combine_dict(Score)
score_def_list = string("Student number, Score\n") >> score_def.many()
grade_value = integer
grade_def = string("Grade = ") >> grade_value << string("\n")
school_grade = seq(grade_def.tag('grade'),
student_def_list.tag('students') << regex(r"\n*"),
score_def_list.tag('scores') << regex(r"\n*")
).combine_dict(Grade)
school_name = regex(r"[^\n]+")
school_def = string("School = ") >> school_name << string("\n")
school = seq(school_def.tag('name'),
school_grade.many().tag('grades')
).combine_dict(School)
def parse(text):
return school.many().parse(text)
This is much more verbose than a regex solution, but much closer to a declarative definition of your file format.
这比正则表达式解决方案要冗长得多,但更接近于文件格式的声明性定义。
回答by Mike Robins
In a similar manner to your original code I define the parsing regex's
以与您的原始代码类似的方式,我定义了解析正则表达式
import re
import pandas as pd
parse_re = {
'school': re.compile(r'School = (?P<school>.*)$'),
'grade': re.compile(r'Grade = (?P<grade>\d+)'),
'student': re.compile(r'Student number, (?P<info>\w+)'),
'data': re.compile(r'(?P<number>\d+), (?P<value>.*)$'),
}
def parse(line):
'''parse the line by regex search against possible line formats
returning the id and match result of first matching regex,
or None if no match is found'''
return reduce(lambda (i,m),(id,rx): (i,m) if m else (id, rx.search(line)),
parse_re.items(), (None,None))
then loop through the lines gathering the information about each student. Once the record is complete (when we have Score
the record is complete) we append the record to a list.
然后遍历收集每个学生信息的行。一旦记录完成(当我们有Score
记录完成时),我们将记录附加到列表中。
A small state machine that is driven by the line by line regex matches collates each record. In particular we have to save the students in a grade by number as their Score and Name are provided separately in the input file.
由逐行正则表达式匹配驱动的小型状态机整理每条记录。特别是我们必须按编号保存学生的成绩,因为他们的分数和姓名在输入文件中单独提供。
results = []
with open('sample.txt') as f:
record = {}
for line in f:
id, match = parse(line)
if match is None:
continue
if id == 'school':
record['School'] = match.group('school')
elif id == 'grade':
record['Grade'] = int(match.group('grade'))
names = {} # names is a number indexed dictionary of student names
elif id == 'student':
info = match.group('info')
elif id == 'data':
number = int(match.group('number'))
value = match.group('value')
if info == 'Name':
names[number] = value
elif info == 'Score':
record['Student number'] = number
record['Name'] = names[number]
record['Score'] = int(value)
results.append(record.copy())
Finally the list of records is converted to a DataFrame
.
最后,记录列表被转换为DataFrame
.
df = pd.DataFrame(results, columns=['School', 'Grade', 'Student number', 'Name', 'Score'])
print df
Outputs:
输出:
School Grade Student number Name Score
0 Riverdale High 1 0 Phoebe 3
1 Riverdale High 1 1 Rachel 7
2 Riverdale High 2 0 Angela 6
3 Riverdale High 2 1 Tristan 3
4 Riverdale High 2 2 Aurora 9
5 Hogwarts 1 0 Ginny 8
6 Hogwarts 1 1 Luna 7
7 Hogwarts 2 0 Harry 5
8 Hogwarts 2 1 Hermione 10
9 Hogwarts 3 0 Fred 0
10 Hogwarts 3 1 George 0
Some optimizations would be to compare the most common regex's first and to explicitly skip blank lines. Building the dataframe as we go would avoid extra copies of the data but I gather that appending to a dataframe is an expensive operation.
一些优化将比较最常见的正则表达式的第一个并显式跳过空行。在我们进行时构建数据帧将避免额外的数据副本,但我认为附加到数据帧是一项昂贵的操作。