Python 如何将一列拆分为两列？

Question

提问by a k

I have a data frame with one column and I'd like to split it into two columns, with one column header as 'fips'and the other 'row'

我有一个包含一列的数据框，我想将其拆分为两列，一列标题为 ' fips'，另一列标题为'row'

My dataframe dflooks like this:

我的数据框df如下所示：

          row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

I do not know how to use df.row.str[:]to achieve my goal of splitting the row cell. I can use df['fips'] = helloto add a new column and populate it with hello. Any ideas?

我不知道如何使用df.row.str[:]来实现我拆分行单元格的目标。我可以df['fips'] = hello用来添加一个新列并用hello. 有任何想法吗？

         fips       row
0    00000 UNITED STATES
1    01000 ALABAMA 
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

Answer 1

采纳答案by root

There might be a better way, but this here's one approach:

可能有更好的方法，但这是一种方法：

                            row
    0       00000 UNITED STATES
    1             01000 ALABAMA
    2  01001 Autauga County, AL
    3  01003 Baldwin County, AL
    4  01005 Barbour County, AL

df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
                                 columns = ['flips','row'])

   flips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

Answer 2

回答by Andy Hayden

You can extractthe different parts out quite neatly using a regex pattern:

您可以使用正则表达式非常巧妙地提取不同的部分：

In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]: 
    fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]

To explain the somewhat long regex:

解释有点长的正则表达式：

(?P<fips>\d{5})

Matches the five digits (\d) and names them "fips".

匹配五个数字 ( \d) 并命名它们"fips"。

The next part:

下一部分：

((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

Does either (|) one of two things:

做 ( |) 两件事之一：

(?P<state>[A-Z ]*$)

Matches any number (*) of capital letters or spaces ([A-Z ]) and names this "state"before the end of the string ($),

匹配任意数量 ( *) 的大写字母或空格 ( [A-Z ]) 并将其命名"state"在字符串末尾 ( $) 之前，

or

或者

(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

matches anything else (.*) then
a comma and a space then
matches the two digit state_codebefore the end of the string ($).

匹配其他任何东西 ( .*) 然后
一个逗号和一个空格然后
匹配state_code字符串结尾前的两位数字( $)。

In the example:
Note that the first two rows hit the "state" (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).

在示例中：
请注意，前两行命中“州”（将 NaN 留在县和 state_code 列中），而后三行命中县 state_code（将 NaN 留在州列中）。

Answer 3

回答by keberwein

If you don't want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:

如果您不想创建新的数据框，或者您的数据框的列数多于要拆分的列数，则可以：

df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]

Answer 4

回答by LeoRochael

TL;DR version:

TL；DR 版本：

For the simple case of:

对于以下简单情况：

I have a text column with a delimiter and I want two columns

我有一个带分隔符的文本列，我想要两列

The simplest solution is:

最简单的解决方案是：

df['A'], df['B'] = df['AB'].str.split(' ', 1).str

Or you can create create a DataFrame with one column for each entry of the split automatically with:

或者，您可以使用以下命令为拆分的每个条目自动创建一个带有一列的 DataFrame：

df['AB'].str.split(' ', 1, expand=True)

You must use expand=Trueif your strings have a non-uniform number of splits and you want Noneto replace the missing values.

expand=True如果您的字符串具有不均匀的分割数并且您想None替换缺失值，则必须使用。

Notice how, in either case, the .tolist()method is not necessary. Neither is zip().

请注意，在任何一种情况下，该.tolist()方法都不是必需的。也不是zip()。

In detail:

详细：

Andy Hayden's solutionis most excellent in demonstrating the power of the str.extract()method.

安迪·海登 (Andy Hayden) 的解决方案最出色地展示了该str.extract()方法的威力。

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split()method is enough¹. It operates on a column (Series) of strings, and returns a column (Series) of lists:

但是对于已知分隔符上的简单拆分（例如，按破折号拆分或按空格拆分），该.str.split()方法就足够了¹。它对字符串的列 (Series) 进行操作，并返回列表的列 (Series)：

>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

_{1: If you're unsure what the first two parameters of .str.split()do,
I recommend the docs for the plain Python version of the method.}

_{1：如果你不确定.str.split()do的前两个参数是什么，我推荐方法的纯 Python 版本的文档。}

But how do you go from:

但是你如何从：

a column containing two-element lists

包含两个元素列表的列

to:

到：

two columns, each containing the respective element of the lists?

两列，每列包含列表的相应元素？

Well, we need to take a closer look at the .strattribute of a column.

那么，我们需要仔细查看.str列的属性。

It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

它是一个神奇的对象，用于收集将列中的每个元素视为字符串的方法，然后尽可能高效地在每个元素中应用相应的方法：

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

But it also has an "indexing" interface for getting each element of a string by its index:

但它也有一个“索引”接口，用于通过索引获取字符串的每个元素：

>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object

Of course, this indexing interface of .strdoesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:

当然，这个索引接口.str并不真正关心它索引的每个元素是否实际上是一个字符串，只要它可以被索引，所以：

>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do

然后，利用可迭代对象的 Python 元组解包来做一个简单的事情

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split()method can do it for you with the expand=Trueparameter:

当然，从拆分一列字符串中获取 DataFrame 非常有用，该.str.split()方法可以使用expand=True参数为您完成：

>>> df['AB'].str.split('-', 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

So, another way of accomplishing what we wanted is to do:

因此，实现我们想要的另一种方法是：

>>> df = df[['AB']]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

The expand=Trueversion, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:

该expand=True版本虽然更长，但与元组解包方法相比具有明显的优势。元组解包不能很好地处理不同长度的拆分：

>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>>

But expand=Truehandles it nicely by placing Nonein the columns for which there aren't enough "splits":

但是expand=True通过放置None在没有足够“拆分”的列中来很好地处理它：

>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

Answer 5

回答by Jasmine

If you want to split a string into more than two columns based on a delimiter you can omit the 'maximum splits' parameter.
You can use:

如果您想根据分隔符将字符串拆分为两列以上，则可以省略“最大拆分”参数。
您可以使用：

df['column_name'].str.split('/', expand=True)

This will automatically create as many columns as the maximum number of fields included in any of your initial strings.

这将自动创建与任何初始字符串中包含的最大字段数一样多的列。

Answer 6

回答by jezrael

You can use str.splitby whitespace (default separator) and parameter expand=Truefor DataFramewith assign to new columns:

您可以使用str.splitby 空格（默认分隔符）和参数expand=Truefor DataFramewith 分配给新列：

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL']})
print (df)
                        row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL



df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
                        row      a                   b
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

Modification if need remove original column with DataFrame.pop

修改如果需要删除原始列 DataFrame.pop

df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

What is same like:

什么是相同的：

df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)

       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If get error:

如果得到错误：

#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)

ValueError: Columns must be same length as key

ValueError：列的长度必须与键的长度相同

You can check and it return 4 column DataFrame, not only 2:

您可以检查并返回 4 column DataFrame，而不仅仅是 2：

print (df['row'].str.split(expand=True))
       0        1        2     3
0  00000   UNITED   STATES  None
1  01000  ALABAMA     None  None
2  01001  Autauga  County,    AL
3  01003  Baldwin  County,    AL
4  01005  Barbour  County,    AL

Then solution is append new DataFrameby join:

然后解决方案是DataFrame通过join以下方式追加新的：

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL'],
                    'a':range(5)})
print (df)
   a                       row
0  0       00000 UNITED STATES
1  1             01000 ALABAMA
2  2  01001 Autauga County, AL
3  3  01003 Baldwin County, AL
4  4  01005 Barbour County, AL

df = df.join(df['row'].str.split(expand=True))
print (df)

   a                       row      0        1        2     3
0  0       00000 UNITED STATES  00000   UNITED   STATES  None
1  1             01000 ALABAMA  01000  ALABAMA     None  None
2  2  01001 Autauga County, AL  01001  Autauga  County,    AL
3  3  01003 Baldwin County, AL  01003  Baldwin  County,    AL
4  4  01005 Barbour County, AL  01005  Barbour  County,    AL

With remove original column (if there are also another columns):

删除原始列（如果还有其他列）：

df = df.join(df.pop('row').str.split(expand=True))
print (df)
   a      0        1        2     3
0  0  00000   UNITED   STATES  None
1  1  01000  ALABAMA     None  None
2  2  01001  Autauga  County,    AL
3  3  01003  Baldwin  County,    AL
4  4  01005  Barbour  County,    AL

Answer 7

回答by mcchran

I prefer exporting the corresponding pandas series (i.e. the columns I need), using the applyfunction to split the column content into multiple series and then jointhe generated columns to the existing DataFrame. Of course, the source column should be removed.

我宁愿出口相应的熊猫系列（即我需要的列），使用应用功能进行拆分的列内容分为多个系列，然后加入生成的列到现有的数据帧。当然，应该删除源列。

e.g.

例如

 col1 = df["<col_name>"].apply(<function>)
 col2 = ...
 df = df.join(col1.to_frame(name="<name1>"))
 df = df.join(col2.toframe(name="<name2>"))
 df = df.drop(["<col_name>"], axis=1)

To split two words strings function should be something like that:

拆分两个单词字符串函数应该是这样的：

lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element

Answer 8

回答by Bhagabat Behera

df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)

Answer 9

回答by cs95

Surprised I haven't seen this one yet. If you only need twosplits, I highly recommend. . .

很惊讶我还没有看到这个。如果你只需要两个分割，我强烈推荐。. .

`Series.str.partition`

partitionperforms one split on the separator, and is generally quite performant.

partition在分隔符上执行一次拆分，并且通常非常高效。

df['row'].str.partition(' ')[[0, 2]]

       0                   2
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to rename the rows,

如果您需要重命名行，

df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)

    fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to join this back to the original, use joinor concat:

如果您需要将其加入到原始版本中，请使用join或concat：

df.join(df['row'].str.partition(' ')[[0, 2]])

pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)

                        row      0                   2
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

Answer 10

回答by Vingt Cent

I saw that no one had used the slice method, so here I put my 2 cents here.

我看到没有人用过切片方法，所以我把我的 2 美分放在了这里。

df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)

This method will create two new columns.

此方法将创建两个新列。

Python 如何将一列拆分为两列？

提问by a k

采纳答案by root

回答by Andy Hayden

回答by keberwein

回答by LeoRochael

TL;DR version:

TL；DR 版本：

In detail:

详细：

回答by Jasmine

回答by jezrael

回答by mcchran

回答by Bhagabat Behera

回答by cs95

`Series.str.partition`

`Series.str.partition`

回答by Vingt Cent

相关推荐

最近更新

标签

Python 如何将一列拆分为两列？

提问by a k

采纳答案by root

回答by Andy Hayden

回答by keberwein

回答by LeoRochael

TL;DR version:

TL；DR 版本：

In detail:

详细：

回答by Jasmine

回答by jezrael

回答by mcchran

回答by Bhagabat Behera

回答by cs95

Series.str.partition

Series.str.partition

回答by Vingt Cent

相关推荐

Python Pandas 中的多索引排序

Python __str__ 和 __repr__ 的目的是什么？

Python pandas read_csv index_col=None 不使用每行末尾的分隔符

Python 如何按键访问pandas groupby数据框

相关推荐

最近更新

标签

`Series.str.partition`

`Series.str.partition`

Python str 和 repr 的目的是什么？