从具有特定模式的 txt 文件创建 Pandas DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41386443/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:41:24  来源:igfitidea点击:

Create Pandas DataFrame from txt file with specific pattern

pythonregexpandastextextract

提问by Peter Wilson

I need to create a Pandas DataFrame based on a text file based on the following structure:

我需要基于基于以下结构的文本文件创建一个 Pandas DataFrame:

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Hymansonville (Hymansonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]

The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.

带有“[edit]”的行是States,行[number] 是Regions。我需要拆分以下内容,然后为每个区域名称重复州名称。

Index          State          Region Name
0              Alabama        Aurburn...
1              Alabama        Florence...
2              Alabama        Hymansonville...
...
9              Alaska         Fairbanks...
10             Alaska         Arizona...
11             Alaska         Flagstaff...

Pandas DataFrame

Pandas数据框

I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.

我不确定如何将基于“[edit]”和“[number]”或“(characters)”的文本文件拆分为相应的列,并为每个区域名称重复 State Name。请任何人都可以给我一个起点来开始完成以下工作。

采纳答案by jezrael

You can first read_csvwith parameter namefor create DataFramewith column Region Name, separator is value which is NOT in values (like ;):

您可以首先read_csv使用参数name创建DataFrameRegion Name,分隔符是不在值中的值(如;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

Then insertnew column Statewith extractrows where text [edit]and replaceall values from (to the end to column Region Name.

然后 insert新列Stateextract行,其中文本[edit]replace所有值从(到结束到列Region Name

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

Last remove rows where text [edit]by boolean indexing, mask is created by str.contains:

最后删除文本[edit]boolean indexing,掩码创建的行str.contains

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Hymansonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

If need all values solution is easier:

如果需要所有值的解决方案更容易:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Hymansonville (Hymansonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

回答by ultra909

You could parse the file into tuples first:

您可以先将文件解析为元组:

import pandas as pd
from collections import namedtuple

Item = namedtuple('Item', 'state area')
items = []

with open('unis.txt') as f: 
    for line in f:
        l = line.rstrip('\n') 
        if l.endswith('[edit]'):
            state = l.rstrip('[edit]')
        else:            
            i = l.index(' (')
            area = l[:i]
            items.append(Item(state, area))

df = pd.DataFrame.from_records(items, columns=['State', 'Area'])

print df

output:

输出:

      State          Area
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Hymansonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

回答by MaxU

Assuming you have the following DF:

假设您有以下 DF:

In [73]: df
Out[73]:
                                                 text
0                                       Alabama[edit]
1                       Auburn (Auburn University)[1]
2              Florence (University of North Alabama)
3     Hymansonville (Hymansonville State University)[2]
4          Livingston (University of West Alabama)[2]
5            Montevallo (University of Montevallo)[2]
6                           Troy (Troy University)[2]
7   Tuscaloosa (University of Alabama, Stillman Co...
8                   Tuskegee (Tuskegee University)[5]
9                                        Alaska[edit]
10      Fairbanks (University of Alaska Fairbanks)[2]
11                                      Arizona[edit]
12         Flagstaff (Northern Arizona University)[6]
13                   Tempe (Arizona State University)
14                     Tucson (University of Arizona)
15                                     Arkansas[edit]

you can use Series.str.extract()method:

您可以使用Series.str.extract()方法:

In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)

In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)

In [120]: df.State = df.State.ffill()

In [121]: df
Out[121]:
                                                 text     State   Region Name
0                                       Alabama[edit]   Alabama           NaN
1                       Auburn (Auburn University)[1]   Alabama        Auburn
2              Florence (University of North Alabama)   Alabama      Florence
3     Hymansonville (Hymansonville State University)[2]   Alabama  Hymansonville
4          Livingston (University of West Alabama)[2]   Alabama    Livingston
5            Montevallo (University of Montevallo)[2]   Alabama    Montevallo
6                           Troy (Troy University)[2]   Alabama          Troy
7   Tuscaloosa (University of Alabama, Stillman Co...   Alabama    Tuscaloosa
8                   Tuskegee (Tuskegee University)[5]   Alabama      Tuskegee
9                                        Alaska[edit]    Alaska           NaN
10      Fairbanks (University of Alaska Fairbanks)[2]    Alaska     Fairbanks
11                                      Arizona[edit]   Arizona           NaN
12         Flagstaff (Northern Arizona University)[6]   Arizona     Flagstaff
13                   Tempe (Arizona State University)   Arizona         Tempe
14                     Tucson (University of Arizona)   Arizona        Tucson
15                                     Arkansas[edit]  Arkansas           NaN

In [122]: df = df.dropna()

In [123]: df
Out[123]:
                                                 text    State   Region Name
1                       Auburn (Auburn University)[1]  Alabama        Auburn
2              Florence (University of North Alabama)  Alabama      Florence
3     Hymansonville (Hymansonville State University)[2]  Alabama  Hymansonville
4          Livingston (University of West Alabama)[2]  Alabama    Livingston
5            Montevallo (University of Montevallo)[2]  Alabama    Montevallo
6                           Troy (Troy University)[2]  Alabama          Troy
7   Tuscaloosa (University of Alabama, Stillman Co...  Alabama    Tuscaloosa
8                   Tuskegee (Tuskegee University)[5]  Alabama      Tuskegee
10      Fairbanks (University of Alaska Fairbanks)[2]   Alaska     Fairbanks
12         Flagstaff (Northern Arizona University)[6]  Arizona     Flagstaff
13                   Tempe (Arizona State University)  Arizona         Tempe
14                     Tucson (University of Arizona)  Arizona        Tucson

回答by piRSquared

TL;DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]

TL; 博士
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]



regex = '(?P<State>.*?)\[edit\]'  # pattern to match
print(s.groupby(
    # will get nulls where we don't have "[edit]"
    # forward fill fills in the most recent line
    # where we did have an "[edit]"
    s.str.extract(regex, expand=False).ffill()  
).apply(
    # I still have all the original values
    # If I group by the forward filled rows
    # I'll want to drop the first one within each group
    pd.Series.tail, n=-1
).reset_index(
    # munge the dataframe to get columns sorted
    name='Region_Name'
)[['State', 'Region_Name']])

      State                                        Region_Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Hymansonville (Hymansonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)


setup

设置

txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Hymansonville (Hymansonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""

s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)

回答by Brian Leach

You will probably need to perform some additional manipulation on the file before getting it into a dataframe.

在将文件放入数据帧之前,您可能需要对文件执行一些额外的操作。

A starting point would be to split the file into lines, search for the string [edit]in each line, put the string name as the key of a dictionary when it is there...

一个起点是将文件分成几行,[edit]在每一行中搜索字符串,将字符串名称作为字典的键,当它存在时......

I do not think that Pandas has any built in methods that would handle a file in this format.

我认为 Pandas 没有任何内置方法可以处理这种格式的文件。