在 Python 中解析非结构化文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1419653/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Parsing unstructured text in Python
提问by Francis
I wanted to parse a text file that contains unstructured text. I need to get the address, date of birth, name, sex, and ID.
我想解析一个包含非结构化文本的文本文件。我需要得到地址、出生日期、姓名、性别和身。
. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13
In the example above, the first 3 lines is the address, the line with just an "F" is the sex, the DOB would be the line after "F", name after the DOB, the ID after the name, and the no. 12 under the ID is the index/record no.
在上面的例子中,前 3 行是地址,只有一个“F”的行是性别,DOB 是“F”之后的行,DOB 之后的名字,名字之后的 ID,以及 no . ID下的12是索引/记录号。
However, the format is not consistent. In the second group, the address is 4 lines instead of 3 and the index/record no. is appended after the name (if the person doesn't have an ID field).
但是,格式不一致。在第二组中,地址是 4 行而不是 3 行,并且索引/记录号。附加在姓名之后(如果此人没有 ID 字段)。
I wanted to rewrite the text into the following format:
我想将文本重写为以下格式:
name, ID, address, sex, DOB
回答by PaulMcG
Here is a first stab at a pyparsing solution (easy-to-copy code at the pyparsing pastebin). Walk through the separate parts, according to the interleaved comments.
这是 pyparsing 解决方案的第一次尝试(pyparsing pastebin 中易于复制的代码)。根据交错的评论浏览单独的部分。
data = """\
. 55 MORILLO ZONE VIII,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
F
01/16/1952
ALOMO, TERESITA CABALLES
3412-00000-A1652TCA2
12
. 22 FABRICANTE ST. ZONE
VIII LUISIANA LAGROS,
BARANGAY ZONE VIII
(POB.), LUISIANA, LAGROS
M
10/14/1967
AMURAO, CALIXTO MANALO13
"""
from pyparsing import LineEnd, oneOf, Word, nums, Combine, restOfLine, \
alphanums, Suppress, empty, originalTextFor, OneOrMore, alphas, \
Group, ZeroOrMore
NL = LineEnd().suppress()
gender = oneOf("M F")
integer = Word(nums)
date = Combine(integer + '/' + integer + '/' + integer)
# define the simple line definitions
gender_line = gender("sex") + NL
dob_line = date("DOB") + NL
name_line = restOfLine("name") + NL
id_line = Word(alphanums+"-")("ID") + NL
recnum_line = integer("recnum") + NL
# define forms of address lines
first_addr_line = Suppress('.') + empty + restOfLine + NL
# a subsequent address line is any line that is not a gender definition
subsq_addr_line = ~(gender_line) + restOfLine + NL
# a line with a name and a recnum combined, if there is no ID
name_recnum_line = originalTextFor(OneOrMore(Word(alphas+',')))("name") + \
integer("recnum") + NL
# defining the form of an overall record, either with or without an ID
record = Group((first_addr_line + ZeroOrMore(subsq_addr_line))("address") +
gender_line +
dob_line +
((name_line +
id_line +
recnum_line) |
name_recnum_line))
# parse data
records = OneOrMore(record).parseString(data)
# output the desired results (note that address is actually a list of lines)
for rec in records:
if rec.ID:
print "%(name)s, %(ID)s, %(address)s, %(sex)s, %(DOB)s" % rec
else:
print "%(name)s, , %(address)s, %(sex)s, %(DOB)s" % rec
print
# how to access the individual fields of the parsed record
for rec in records:
print rec.dump()
print rec.name, 'is', rec.sex
print
Prints:
印刷:
ALOMO, TERESITA CABALLES, 3412-00000-A1652TCA2, ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], F, 01/16/1952
AMURAO, CALIXTO MANALO, , ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS'], M, 10/14/1967
['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'F', '01/16/1952', 'ALOMO, TERESITA CABALLES', '3412-00000-A1652TCA2', '12']
- DOB: 01/16/1952
- ID: 3412-00000-A1652TCA2
- address: ['55 MORILLO ZONE VIII,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: ALOMO, TERESITA CABALLES
- recnum: 12
- sex: F
ALOMO, TERESITA CABALLES is F
['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS', 'M', '10/14/1967', 'AMURAO, CALIXTO MANALO', '13']
- DOB: 10/14/1967
- address: ['22 FABRICANTE ST. ZONE', 'VIII LUISIANA LAGROS,', 'BARANGAY ZONE VIII', '(POB.), LUISIANA, LAGROS']
- name: AMURAO, CALIXTO MANALO
- recnum: 13
- sex: M
AMURAO, CALIXTO MANALO is M
回答by Nathan
you have to exploit whatever regularity and structure the text does have.
您必须利用文本确实具有的任何规律性和结构。
I suggest you read one line at a time and match it to a regular expression to determine its type, fill in the appropriate field in a person object. writing out that object and starting a new one whenever you get a field that you already have filled in.
我建议您一次阅读一行并将其与正则表达式匹配以确定其类型,在人员对象中填写适当的字段。每当你得到一个你已经填写的字段时,写出那个对象并开始一个新的对象。
回答by Tristan
It may be overkill, but the leading edge machine learning algorithms for this type of problem are based on conditional random fields. For example, Accurate Information Extraction from Research Papers using Conditional Random Fields.
这可能有点矫枉过正,但针对此类问题的前沿机器学习算法是基于条件随机场的。例如,使用条件随机场从研究论文中准确提取信息。
There is software out there that makes training these models relatively easy. See Malletor CRF++.
回答by twneale
You can probably do this with regular expressions without too much difficulty. If you have never used them before, check out the python documentation, then fire up redemo.py (on my computer, it's in c:\python26\Tools\scripts).
您可能可以使用正则表达式轻松做到这一点。如果您以前从未使用过它们,请查看 python 文档,然后启动 redemo.py(在我的计算机上,它在 c:\python26\Tools\scripts 中)。
The first task is to split the flat file into a list of entities (one chunk of text per record). From the snippet of text you gave, you could split the file with a pattern matching the beginning of a line, where the first character is a dot:
第一个任务是将平面文件拆分为实体列表(每条记录一个文本块)。从您提供的文本片段中,您可以使用与行首匹配的模式拆分文件,其中第一个字符是一个点:
import re
re_entity_splitter = re.compile(r'^\.')
entities = re_entity_splitter.split(open(textfile).read())
Note that the dot must be escaped (it's a wildcard character by default). Note also the r before the pattern. The r denotes 'raw string' format, which excuses you from having to escape the escape characters, resulting in so-called 'backslash plague.'
请注意,必须对点进行转义(默认情况下它是通配符)。还要注意模式前的 r。r 表示“原始字符串”格式,这使您不必转义转义字符,从而导致所谓的“反斜杠瘟疫”。
Once you have the file split into individual people, picking out the gender and birthdate is a snap. Use these:
将文件拆分为个人后,即可轻松选择性别和出生日期。使用这些:
re_gender = re.compile(r'^[MF]')
re_birth_Date = re.compile(r'\d\d/\d\d/\d\d')
And away you go. You can paste the flat file into re demo GUI and experiment with creating patterns to match what you need. You'll have it parsed in no time. Once you get good at this, you can use symbolic group names (see docs) to pick out individual elements quickly and cleanly.
你走吧。您可以将平面文件粘贴到重新演示 GUI 中并尝试创建模式以匹配您的需要。你很快就会解析它。一旦你掌握了这一点,你就可以使用符号组名(参见文档)来快速、干净地挑选出单个元素。
回答by Unknown
Here's a quick hack job.
这是一个快速的黑客工作。
f = open('data.txt')
def process(file):
address = ""
for line in file:
if line == '': raise StopIteration
line = line.rstrip() # to ignore \n
if line in ('M','F'):
sex = line
break
else:
address += line
DOB = file.readline().rstrip() # to ignore \n
name = file.readline().rstrip()
if name[-1].isdigit():
name = re.match(r'^([^\d]+)\d+', name).group(1)
ID = None
else:
ID = file.readline().rstrip()
file.readline() # ignore the record #
print (name, ID, address, sex, DOB)
while True:
process(f)