从具有特定模式的 txt 文件创建 Pandas DataFrame

Question

提问by Peter Wilson

I need to create a Pandas DataFrame based on a text file based on the following structure:

我需要基于基于以下结构的文本文件创建一个 Pandas DataFrame：

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Hymansonville (Hymansonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]

The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.

带有“[edit]”的行是States，行[number] 是Regions。我需要拆分以下内容，然后为每个区域名称重复州名称。

Index          State          Region Name
0              Alabama        Aurburn...
1              Alabama        Florence...
2              Alabama        Hymansonville...
...
9              Alaska         Fairbanks...
10             Alaska         Arizona...
11             Alaska         Flagstaff...

Pandas DataFrame

Pandas数据框

I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.

我不确定如何将基于“[edit]”和“[number]”或“(characters)”的文本文件拆分为相应的列，并为每个区域名称重复 State Name。请任何人都可以给我一个起点来开始完成以下工作。

Answer 1

采纳答案by jezrael

You can first read_csvwith parameter namefor create DataFramewith column Region Name, separator is value which is NOT in values (like ;):

您可以首先read_csv使用参数name创建DataFrame列Region Name，分隔符是不在值中的值（如;）：

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

Then insertnew column Statewith extractrows where text [edit]and replaceall values from (to the end to column Region Name.

然后 insert新列State的extract行，其中文本[edit]和replace所有值从(到结束到列Region Name。

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

Last remove rows where text [edit]by boolean indexing, mask is created by str.contains:

最后删除文本[edit]由boolean indexing，掩码创建的行str.contains：

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Hymansonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

If need all values solution is easier:

如果需要所有值的解决方案更容易：

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Hymansonville (Hymansonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

Answer 2

回答by ultra909

You could parse the file into tuples first:

您可以先将文件解析为元组：

import pandas as pd
from collections import namedtuple

Item = namedtuple('Item', 'state area')
items = []

with open('unis.txt') as f: 
    for line in f:
        l = line.rstrip('\n') 
        if l.endswith('[edit]'):
            state = l.rstrip('[edit]')
        else:            
            i = l.index(' (')
            area = l[:i]
            items.append(Item(state, area))

df = pd.DataFrame.from_records(items, columns=['State', 'Area'])

print df

output:

输出：

      State          Area
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Hymansonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

Answer 3

回答by MaxU

Assuming you have the following DF:

假设您有以下 DF：

In [73]: df
Out[73]:
                                                 text
0                                       Alabama[edit]
1                       Auburn (Auburn University)[1]
2              Florence (University of North Alabama)
3     Hymansonville (Hymansonville State University)[2]
4          Livingston (University of West Alabama)[2]
5            Montevallo (University of Montevallo)[2]
6                           Troy (Troy University)[2]
7   Tuscaloosa (University of Alabama, Stillman Co...
8                   Tuskegee (Tuskegee University)[5]
9                                        Alaska[edit]
10      Fairbanks (University of Alaska Fairbanks)[2]
11                                      Arizona[edit]
12         Flagstaff (Northern Arizona University)[6]
13                   Tempe (Arizona State University)
14                     Tucson (University of Arizona)
15                                     Arkansas[edit]

you can use Series.str.extract()method:

您可以使用Series.str.extract()方法：

In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)

In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)

In [120]: df.State = df.State.ffill()

In [121]: df
Out[121]:
                                                 text     State   Region Name
0                                       Alabama[edit]   Alabama           NaN
1                       Auburn (Auburn University)[1]   Alabama        Auburn
2              Florence (University of North Alabama)   Alabama      Florence
3     Hymansonville (Hymansonville State University)[2]   Alabama  Hymansonville
4          Livingston (University of West Alabama)[2]   Alabama    Livingston
5            Montevallo (University of Montevallo)[2]   Alabama    Montevallo
6                           Troy (Troy University)[2]   Alabama          Troy
7   Tuscaloosa (University of Alabama, Stillman Co...   Alabama    Tuscaloosa
8                   Tuskegee (Tuskegee University)[5]   Alabama      Tuskegee
9                                        Alaska[edit]    Alaska           NaN
10      Fairbanks (University of Alaska Fairbanks)[2]    Alaska     Fairbanks
11                                      Arizona[edit]   Arizona           NaN
12         Flagstaff (Northern Arizona University)[6]   Arizona     Flagstaff
13                   Tempe (Arizona State University)   Arizona         Tempe
14                     Tucson (University of Arizona)   Arizona        Tucson
15                                     Arkansas[edit]  Arkansas           NaN

In [122]: df = df.dropna()

In [123]: df
Out[123]:
                                                 text    State   Region Name
1                       Auburn (Auburn University)[1]  Alabama        Auburn
2              Florence (University of North Alabama)  Alabama      Florence
3     Hymansonville (Hymansonville State University)[2]  Alabama  Hymansonville
4          Livingston (University of West Alabama)[2]  Alabama    Livingston
5            Montevallo (University of Montevallo)[2]  Alabama    Montevallo
6                           Troy (Troy University)[2]  Alabama          Troy
7   Tuscaloosa (University of Alabama, Stillman Co...  Alabama    Tuscaloosa
8                   Tuskegee (Tuskegee University)[5]  Alabama      Tuskegee
10      Fairbanks (University of Alaska Fairbanks)[2]   Alaska     Fairbanks
12         Flagstaff (Northern Arizona University)[6]  Arizona     Flagstaff
13                   Tempe (Arizona State University)  Arizona         Tempe
14                     Tucson (University of Arizona)  Arizona        Tucson

Answer 4

回答by piRSquared

TL;DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]

TL; 博士
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]

regex = '(?P<State>.*?)\[edit\]'  # pattern to match
print(s.groupby(
    # will get nulls where we don't have "[edit]"
    # forward fill fills in the most recent line
    # where we did have an "[edit]"
    s.str.extract(regex, expand=False).ffill()  
).apply(
    # I still have all the original values
    # If I group by the forward filled rows
    # I'll want to drop the first one within each group
    pd.Series.tail, n=-1
).reset_index(
    # munge the dataframe to get columns sorted
    name='Region_Name'
)[['State', 'Region_Name']])

      State                                        Region_Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Hymansonville (Hymansonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

setup

设置

txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Hymansonville (Hymansonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""

s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)

Answer 5

回答by Brian Leach

You will probably need to perform some additional manipulation on the file before getting it into a dataframe.

在将文件放入数据帧之前，您可能需要对文件执行一些额外的操作。

A starting point would be to split the file into lines, search for the string [edit]in each line, put the string name as the key of a dictionary when it is there...

一个起点是将文件分成几行，[edit]在每一行中搜索字符串，将字符串名称作为字典的键，当它存在时......

I do not think that Pandas has any built in methods that would handle a file in this format.

我认为 Pandas 没有任何内置方法可以处理这种格式的文件。

从具有特定模式的 txt 文件创建 Pandas DataFrame

提问by Peter Wilson

采纳答案by jezrael

回答by ultra909

回答by MaxU

回答by piRSquared

回答by Brian Leach

相关推荐

最近更新

标签

从具有特定模式的 txt 文件创建 Pandas DataFrame

提问by Peter Wilson

采纳答案by jezrael

回答by ultra909

回答by MaxU

回答by piRSquared

回答by Brian Leach

相关推荐

在 Pandas DataFrame 中按字典分组

pandas 将字典转换为数据框时如何设置索引？

pandas 在 numpy 数组中前向填充 NaN 值的最有效方法

Pandas(Python)：用前一行值填充空单元格？

相关推荐

最近更新

标签