pandas 如何使用 Python 解析复杂的文本文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47982949/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:58:09  来源:igfitidea点击:

How to parse complex text files using Python?

pythonregexpandasparsing

提问by bluprince13

I'm looking for a simple way of parsing complex text files into a pandas DataFrame. Below is a sample file, what I want the result to look like after parsing, and my current method.

我正在寻找一种将复杂文本文件解析为 Pandas DataFrame 的简单方法。下面是一个示例文件,我希望解析后的结果是什么样的,以及我当前的方法。

Is there any way to make it more concise/faster/more pythonic/more readable?

有没有办法让它更简洁/更快/更pythonic/更易读?

I've also put this question on Code Review.

我也把这个问题放在Code Review 上

I eventually wrote a blog article to explain this to beginners.

我最终写了一篇博客文章向初学者解释这一点

Here is a sample file:

这是一个示例文件:

Sample text

A selection of students from Riverdale High and Hogwarts took part in a quiz. This is a record of their scores.

School = Riverdale High
Grade = 1
Student number, Name
0, Phoebe
1, Rachel

Student number, Score
0, 3
1, 7

Grade = 2
Student number, Name
0, Angela
1, Tristan
2, Aurora

Student number, Score
0, 6
1, 3
2, 9

School = Hogwarts
Grade = 1
Student number, Name
0, Ginny
1, Luna

Student number, Score
0, 8
1, 7

Grade = 2
Student number, Name
0, Harry
1, Hermione

Student number, Score
0, 5
1, 10

Grade = 3
Student number, Name
0, Fred
1, George

Student number, Score
0, 0
1, 0

Here is what I want the result to look like after parsing:

这是我希望解析后的结果:

                                         Name  Score
School         Grade Student number                 
Hogwarts       1     0                  Ginny      8
                     1                   Luna      7
               2     0                  Harry      5
                     1               Hermione     10
               3     0                   Fred      0
                     1                 George      0
Riverdale High 1     0                 Phoebe      3
                     1                 Rachel      7
               2     0                 Angela      6
                     1                Tristan      3
                     2                 Aurora      9

Here is how I currently parse it:

这是我目前解析它的方式:

import re
import pandas as pd


def parse(filepath):
    """
    Parse text at given filepath

    Parameters
    ----------
    filepath : str
        Filepath for file to be parsed

    Returns
    -------
    data : pd.DataFrame
        Parsed data

    """

    data = []
    with open(filepath, 'r') as file:
        line = file.readline()
        while line:
            reg_match = _RegExLib(line)

            if reg_match.school:
                school = reg_match.school.group(1)

            if reg_match.grade:
                grade = reg_match.grade.group(1)
                grade = int(grade)

            if reg_match.name_score:
                value_type = reg_match.name_score.group(1)
                line = file.readline()
                while line.strip():
                    number, value = line.strip().split(',')
                    value = value.strip()
                    dict_of_data = {
                        'School': school,
                        'Grade': grade,
                        'Student number': number,
                        value_type: value
                    }
                    data.append(dict_of_data)
                    line = file.readline()

            line = file.readline()

        data = pd.DataFrame(data)
        data.set_index(['School', 'Grade', 'Student number'], inplace=True)
        # consolidate df to remove nans
        data = data.groupby(level=data.index.names).first()
        # upgrade Score from float to integer
        data = data.apply(pd.to_numeric, errors='ignore')
    return data


class _RegExLib:
    """Set up regular expressions"""
    # use https://regexper.com to visualise these if required
    _reg_school = re.compile('School = (.*)\n')
    _reg_grade = re.compile('Grade = (.*)\n')
    _reg_name_score = re.compile('(Name|Score)')

    def __init__(self, line):
        # check whether line has a positive match with all of the regular expressions
        self.school = self._reg_school.match(line)
        self.grade = self._reg_grade.match(line)
        self.name_score = self._reg_name_score.search(line)


if __name__ == '__main__':
    filepath = 'sample.txt'
    data = parse(filepath)
    print(data)

回答by Jan

Update 2019 (PEG parser):

2019 年更新(PEG 解析器):

This answer has received quite some attention so I felt to add another possibility, namely a parsing option. Here we could use a PEGparser instead (e.g. parsimonious) in combination with a NodeVisitorclass:

这个答案受到了相当多的关注,所以我想添加另一种可能性,即解析选项。在这里,我们可以使用PEG解析器代替(例如parsimonious)与NodeVisitor类结合使用:

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor
import pandas as pd
grammar = Grammar(
    r"""
    schools         = (school_block / ws)+

    school_block    = school_header ws grade_block+ 
    grade_block     = grade_header ws name_header ws (number_name)+ ws score_header ws (number_score)+ ws? 

    school_header   = ~"^School = (.*)"m
    grade_header    = ~"^Grade = (\d+)"m
    name_header     = "Student number, Name"
    score_header    = "Student number, Score"

    number_name     = index comma name ws
    number_score    = index comma score ws

    comma           = ws? "," ws?

    index           = number+
    score           = number+

    number          = ~"\d+"
    name            = ~"[A-Z]\w+"
    ws              = ~"\s*"
    """
)

tree = grammar.parse(data)

class SchoolVisitor(NodeVisitor):
    output, names = ([], [])
    current_school, current_grade = None, None

    def _getName(self, idx):
        for index, name in self.names:
            if index == idx:
                return name

    def generic_visit(self, node, visited_children):
        return node.text or visited_children

    def visit_school_header(self, node, children):
        self.current_school = node.match.group(1)

    def visit_grade_header(self, node, children):
        self.current_grade = node.match.group(1)
        self.names = []

    def visit_number_name(self, node, children):
        index, name = None, None
        for child in node.children:
            if child.expr.name == 'name':
                name = child.text
            elif child.expr.name == 'index':
                index = child.text

        self.names.append((index, name))

    def visit_number_score(self, node, children):
        index, score = None, None
        for child in node.children:
            if child.expr.name == 'index':
                index = child.text
            elif child.expr.name == 'score':
                score = child.text

        name = self._getName(index)

        # build the entire entry
        entry = (self.current_school, self.current_grade, index, name, score)
        self.output.append(entry)

sv = SchoolVisitor()
sv.visit(tree)

df = pd.DataFrame.from_records(sv.output, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)


Regex option (original answer)

正则表达式选项(原始答案)

Well then, watching Lord of the Rings the xth time, I had to bridge some time to the very finale:

那么,第 x 次观看指环王,我不得不为大结局留出一些时间:



分解,我们的想法是将问题分解为几个较小的问题:

  1. Separate each school
  2. ... each grade
  3. ... student and scores
  4. ... bind them together in a dataframe afterwards
  1. 每个学校分开
  2. ...每个年级
  3. ...学生和分数
  4. ... 之后将它们绑定到一个数据框中



学校部分(见 a demo on regex101.comregex101.com 上的演示

^
School\s*=\s*(?P<school_name>.+)
(?P<school_content>[\s\S]+?)
(?=^School|\Z)



成绩部分(another demo on regex101.comregex101.com 上的另一个演示

^
Grade\s*=\s*(?P<grade>.+)
(?P<students>[\s\S]+?)
(?=^Grade|\Z)



学生/分数部分(last demo on regex101.comregex101.com 上的最后一个演示):

^
Student\ number,\ Name[\n\r]
(?P<student_names>(?:^\d+.+[\n\r])+)
\s*
^
Student\ number,\ Score[\n\r]
(?P<student_scores>(?:^\d+.+[\n\r])+)

The rest is a generator expression which is then fed into the DataFrameconstructor (along with the column names).

其余的是一个生成器表达式,然后将其送入DataFrame构造函数(连同列名)。



编码:

import pandas as pd, re

rx_school = re.compile(r'''
    ^
    School\s*=\s*(?P<school_name>.+)
    (?P<school_content>[\s\S]+?)
    (?=^School|\Z)
''', re.MULTILINE | re.VERBOSE)

rx_grade = re.compile(r'''
    ^
    Grade\s*=\s*(?P<grade>.+)
    (?P<students>[\s\S]+?)
    (?=^Grade|\Z)
''', re.MULTILINE | re.VERBOSE)

rx_student_score = re.compile(r'''
    ^
    Student\ number,\ Name[\n\r]
    (?P<student_names>(?:^\d+.+[\n\r])+)
    \s*
    ^
    Student\ number,\ Score[\n\r]
    (?P<student_scores>(?:^\d+.+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)


result = ((school.group('school_name'), grade.group('grade'), student_number, name, score)
    for school in rx_school.finditer(string)
    for grade in rx_grade.finditer(school.group('school_content'))
    for student_score in rx_student_score.finditer(grade.group('students'))
    for student in zip(student_score.group('student_names')[:-1].split("\n"), student_score.group('student_scores')[:-1].split("\n"))
    for student_number in [student[0].split(", ")[0]]
    for name in [student[0].split(", ")[1]]
    for score in [student[1].split(", ")[1]]
)

df = pd.DataFrame(result, columns = ['School', 'Grade', 'Student number', 'Name', 'Score'])
print(df)



浓缩:

rx_school = re.compile(r'^School\s*=\s*(?P<school_name>.+)(?P<school_content>[\s\S]+?)(?=^School|\Z)', re.MULTILINE)
rx_grade = re.compile(r'^Grade\s*=\s*(?P<grade>.+)(?P<students>[\s\S]+?)(?=^Grade|\Z)', re.MULTILINE)
rx_student_score = re.compile(r'^Student number, Name[\n\r](?P<student_names>(?:^\d+.+[\n\r])+)\s*^Student number, Score[\n\r](?P<student_scores>(?:^\d+.+[\n\r])+)', re.MULTILINE)



这产生

            School Grade Student number      Name Score
0   Riverdale High     1              0    Phoebe     3
1   Riverdale High     1              1    Rachel     7
2   Riverdale High     2              0    Angela     6
3   Riverdale High     2              1   Tristan     3
4   Riverdale High     2              2    Aurora     9
5         Hogwarts     1              0     Ginny     8
6         Hogwarts     1              1      Luna     7
7         Hogwarts     2              0     Harry     5
8         Hogwarts     2              1  Hermione    10
9         Hogwarts     3              0      Fred     0
10        Hogwarts     3              1    George     0



至于 timing时间,这是运行一万次的结果:

import timeit
print(timeit.timeit(makedf, number=10**4))
# 11.918397722000009 s


回答by Ezer K

here is my suggestion using split and pd.concat ("txt" stands for a copy of the original text in the question), basicly the idea is to split by the group words and then concat into data frames, the most inner parsing takes advantage of the fact that the names and grades are in a csv like format. here goes:

这是我的建议使用 split 和 pd.concat (“txt”代表问题中原始文本的副本),基本上的想法是按组词拆分然后连接到数据帧中,最内部的解析利用事实上,姓名和成绩采用类似 csv 的格式。开始:

import pandas as pd
from io import StringIO

schools = txt.lower().split('school = ')
schools_dfs = []
for school in schools[1:]:
    grades = school.split('grade = ') 
    grades_dfs = []
    for grade in grades[1:]:
        features = grade.split('student number,')
        feature_dfs = []
        for feature in features[1:]:
            feature_dfs.append(pd.read_csv(StringIO(feature)))
        feature_df = pd.concat(feature_dfs, axis=1)
        feature_df['grade'] = features[0].replace('\n','')
        grades_dfs.append(feature_df)
    grades_df = pd.concat(grades_dfs)
    grades_df['school'] = grades[0].replace('\n','')
    schools_dfs.append(grades_df)
schools_df = pd.concat(schools_dfs)

schools_df.set_index(['school', 'grade'])

enter image description here

在此处输入图片说明

回答by spookylukey

I would suggest using a parser combinator library like parsy. Compared to using regexes, the result will not be as concise, but it will be much more readable and robust, while still being relatively light-weight.

我建议使用像parsy这样的解析器组合器库。与使用正则表达式相比,结果不会那么简洁,但会更具可读性和健壮性,同时仍然相对轻量级。

Parsing is in general quite a hard task, and an approach that is good for people at beginner level for general programming might be hard to find.

解析通常是一项非常艰巨的任务,并且可能很难找到适合初学者级别的通用编程人员的方法。

EDIT: Some actual example code that does minimal parsing of your supplied example. It does not pass to pandas, or even match up names to scores, or students to grades etc. - it just returns a hierarchy of objects starting with Schoolat the top, with the relevant attributes as you would expect:

编辑:一些实际的示例代码,对您提供的示例进行最少的解析。它不会传递给Pandas,甚至不会将姓名与分数、学生与成绩等进行匹配——它只返回从School顶部开始的对象层次结构,以及您所期望的相关属性:

from parsy import string, regex, seq
import attr


@attr.s
class Student():
    name = attr.ib()
    number = attr.ib()


@attr.s
class Score():
    score = attr.ib()
    number = attr.ib()


@attr.s
class Grade():
    grade = attr.ib()
    students = attr.ib()
    scores = attr.ib()


@attr.s
class School():
    name = attr.ib()
    grades = attr.ib()


integer = regex(r"\d+").map(int)
student_number = integer
score = integer
student_name = regex(r"[^\n]+")
student_def = seq(student_number.tag('number') << string(", "),
                  student_name.tag('name') << string("\n")).combine_dict(Student)
student_def_list = string("Student number, Name\n") >> student_def.many()
score_def = seq(student_number.tag('number') << string(", "),
                score.tag('score') << string("\n")).combine_dict(Score)
score_def_list = string("Student number, Score\n") >> score_def.many()
grade_value = integer
grade_def = string("Grade = ") >> grade_value << string("\n")
school_grade = seq(grade_def.tag('grade'),
                   student_def_list.tag('students') << regex(r"\n*"),
                   score_def_list.tag('scores') << regex(r"\n*")
                   ).combine_dict(Grade)

school_name = regex(r"[^\n]+")
school_def = string("School = ") >> school_name << string("\n")
school = seq(school_def.tag('name'),
             school_grade.many().tag('grades')
             ).combine_dict(School)


def parse(text):
    return school.many().parse(text)

This is much more verbose than a regex solution, but much closer to a declarative definition of your file format.

这比正则表达式解决方案要冗长得多,但更接近于文件格式的声明性定义。

回答by Mike Robins

In a similar manner to your original code I define the parsing regex's

以与您的原始代码类似的方式,我定义了解析正则表达式

import re
import pandas as pd

parse_re = {
    'school': re.compile(r'School = (?P<school>.*)$'),
    'grade': re.compile(r'Grade = (?P<grade>\d+)'),
    'student': re.compile(r'Student number, (?P<info>\w+)'),
    'data': re.compile(r'(?P<number>\d+), (?P<value>.*)$'),
}

def parse(line):
    '''parse the line by regex search against possible line formats
       returning the id and match result of first matching regex,
       or None if no match is found'''
    return reduce(lambda (i,m),(id,rx): (i,m) if m else (id, rx.search(line)), 
                  parse_re.items(), (None,None))

then loop through the lines gathering the information about each student. Once the record is complete (when we have Scorethe record is complete) we append the record to a list.

然后遍历收集每个学生信息的行。一旦记录完成(当我们有Score记录完成时),我们将记录附加到列表中。

A small state machine that is driven by the line by line regex matches collates each record. In particular we have to save the students in a grade by number as their Score and Name are provided separately in the input file.

由逐行正则表达式匹配驱动的小型状态机整理每条记录。特别是我们必须按编号保存学生的成绩,因为他们的分数和姓名在输入文件中单独提供。

results = []
with open('sample.txt') as f:
    record = {}
    for line in f:
        id, match = parse(line)

        if match is None:
            continue

        if id == 'school':
            record['School'] = match.group('school')
        elif id == 'grade':
            record['Grade'] = int(match.group('grade'))
            names = {}  # names is a number indexed dictionary of student names
        elif id == 'student':
            info = match.group('info')
        elif id == 'data':
            number = int(match.group('number'))
            value = match.group('value')
            if info == 'Name':
                names[number] = value
            elif info == 'Score':
                record['Student number'] = number
                record['Name'] = names[number]
                record['Score'] = int(value)
                results.append(record.copy())

Finally the list of records is converted to a DataFrame.

最后,记录列表被转换为DataFrame.

df = pd.DataFrame(results, columns=['School', 'Grade', 'Student number', 'Name', 'Score'])
print df

Outputs:

输出:

            School  Grade  Student number      Name  Score
0   Riverdale High      1               0    Phoebe      3
1   Riverdale High      1               1    Rachel      7
2   Riverdale High      2               0    Angela      6
3   Riverdale High      2               1   Tristan      3
4   Riverdale High      2               2    Aurora      9
5         Hogwarts      1               0     Ginny      8
6         Hogwarts      1               1      Luna      7
7         Hogwarts      2               0     Harry      5
8         Hogwarts      2               1  Hermione     10
9         Hogwarts      3               0      Fred      0
10        Hogwarts      3               1    George      0

Some optimizations would be to compare the most common regex's first and to explicitly skip blank lines. Building the dataframe as we go would avoid extra copies of the data but I gather that appending to a dataframe is an expensive operation.

一些优化将比较最常见的正则表达式的第一个并显式跳过空行。在我们进行时构建数据帧将避免额外的数据副本,但我认为附加到数据帧是一项昂贵的操作。