Python 如何将一列拆分为两列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14745022/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 12:18:27  来源:igfitidea点击:

How to split a column into two columns?

pythondataframepandas

提问by a k

I have a data frame with one column and I'd like to split it into two columns, with one column header as 'fips'and the other 'row'

我有一个包含一列的数据框,我想将其拆分为两列,一列标题为 ' fips',另一列标题为'row'

My dataframe dflooks like this:

我的数据框df如下所示:

          row
0    00000 UNITED STATES
1    01000 ALABAMA
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

I do not know how to use df.row.str[:]to achieve my goal of splitting the row cell. I can use df['fips'] = helloto add a new column and populate it with hello. Any ideas?

我不知道如何使用df.row.str[:]来实现我拆分行单元格的目标。我可以df['fips'] = hello用来添加一个新列并用hello. 有任何想法吗?

         fips       row
0    00000 UNITED STATES
1    01000 ALABAMA 
2    01001 Autauga County, AL
3    01003 Baldwin County, AL
4    01005 Barbour County, AL

采纳答案by root

There might be a better way, but this here's one approach:

可能有更好的方法,但这是一种方法:

                            row
    0       00000 UNITED STATES
    1             01000 ALABAMA
    2  01001 Autauga County, AL
    3  01003 Baldwin County, AL
    4  01005 Barbour County, AL
df = pd.DataFrame(df.row.str.split(' ',1).tolist(),
                                 columns = ['flips','row'])
   flips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

回答by Andy Hayden

You can extractthe different parts out quite neatly using a regex pattern:

您可以使用正则表达式非常巧妙地提取不同的部分:

In [11]: df.row.str.extract('(?P<fips>\d{5})((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))')
Out[11]: 
    fips                    1           state           county state_code
0  00000        UNITED STATES   UNITED STATES              NaN        NaN
1  01000              ALABAMA         ALABAMA              NaN        NaN
2  01001   Autauga County, AL             NaN   Autauga County         AL
3  01003   Baldwin County, AL             NaN   Baldwin County         AL
4  01005   Barbour County, AL             NaN   Barbour County         AL

[5 rows x 5 columns]


To explain the somewhat long regex:

解释有点长的正则表达式:

(?P<fips>\d{5})
  • Matches the five digits (\d) and names them "fips".
  • 匹配五个数字 ( \d) 并命名它们"fips"

The next part:

下一部分:

((?P<state>[A-Z ]*$)|(?P<county>.*?), (?P<state_code>[A-Z]{2}$))

Does either (|) one of two things:

做 ( |) 两件事之一:

(?P<state>[A-Z ]*$)
  • Matches any number (*) of capital letters or spaces ([A-Z ]) and names this "state"before the end of the string ($),
  • 匹配任意数量 ( *) 的大写字母或空格 ( [A-Z ]) 并将其命名"state"在字符串末尾 ( $) 之前,

or

或者

(?P<county>.*?), (?P<state_code>[A-Z]{2}$))
  • matches anything else (.*) then
  • a comma and a space then
  • matches the two digit state_codebefore the end of the string ($).
  • 匹配其他任何东西 ( .*) 然后
  • 一个逗号和一个空格然后
  • 匹配state_code字符串结尾前的两位数字( $)。

In the example:
Note that the first two rows hit the "state" (leaving NaN in the county and state_code columns), whilst the last three hit the county, state_code (leaving NaN in the state column).

在示例中:
请注意,前两行命中“州”(将 NaN 留在县和 state_code 列中),而后三行命中县 state_code(将 NaN 留在州列中)。

回答by keberwein

If you don't want to create a new dataframe, or if your dataframe has more columns than just the ones you want to split, you could:

如果您不想创建新的数据框,或者您的数据框的列数多于要拆分的列数,则可以:

df["flips"], df["row_name"] = zip(*df["row"].str.split().tolist())
del df["row"]  

回答by LeoRochael

TL;DR version:

TL;DR 版本:

For the simple case of:

对于以下简单情况:

  • I have a text column with a delimiter and I want two columns
  • 我有一个带分隔符的文本列,我想要两列

The simplest solution is:

最简单的解决方案是:

df['A'], df['B'] = df['AB'].str.split(' ', 1).str

Or you can create create a DataFrame with one column for each entry of the split automatically with:

或者,您可以使用以下命令为拆分的每个条目自动创建一个带有一列的 DataFrame:

df['AB'].str.split(' ', 1, expand=True)

You must use expand=Trueif your strings have a non-uniform number of splits and you want Noneto replace the missing values.

expand=True如果您的字符串具有不均匀的分割数并且您想None替换缺失值,则必须使用。

Notice how, in either case, the .tolist()method is not necessary. Neither is zip().

请注意,在任何一种情况下,该.tolist()方法都不是必需的。也不是zip()

In detail:

详细:

Andy Hayden's solutionis most excellent in demonstrating the power of the str.extract()method.

安迪·海登 (Andy Hayden) 的解决方案最出色地展示了该str.extract()方法的威力。

But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split()method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:

但是对于已知分隔符上的简单拆分(例如,按破折号拆分或按空格拆分),该.str.split()方法就足够了1。它对字符串的列 (Series) 进行操作,并返回列表的列 (Series):

>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df

      AB
0  A1-B1
1  A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df

      AB  AB_split
0  A1-B1  [A1, B1]
1  A2-B2  [A2, B2]

1: If you're unsure what the first two parameters of .str.split()do, I recommend the docs for the plain Python version of the method.

1:如果你不确定.str.split()do的前两个参数是什么,我推荐方法纯 Python 版本的文档。

But how do you go from:

但是你如何从:

  • a column containing two-element lists
  • 包含两个元素列表的列

to:

到:

  • two columns, each containing the respective element of the lists?
  • 两列,每列包含列表的相应元素?

Well, we need to take a closer look at the .strattribute of a column.

那么,我们需要仔细查看.str列的属性。

It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:

它是一个神奇的对象,用于收集将列中的每个元素视为字符串的方法,然后尽可能高效地在每个元素中应用相应的方法:

>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df

   U
0  A
1  B
2  C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df

   U  L
0  A  a
1  B  b
2  C  c

But it also has an "indexing" interface for getting each element of a string by its index:

但它也有一个“索引”接口,用于通过索引获取字符串的每个元素:

>>> df['AB'].str[0]

0    A
1    A
Name: AB, dtype: object

>>> df['AB'].str[1]

0    1
1    2
Name: AB, dtype: object

Of course, this indexing interface of .strdoesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:

当然,这个索引接口.str并不真正关心它索引的每个元素是否实际上是一个字符串,只要它可以被索引,所以:

>>> df['AB'].str.split('-', 1).str[0]

0    A1
1    A2
Name: AB, dtype: object

>>> df['AB'].str.split('-', 1).str[1]

0    B1
1    B2
Name: AB, dtype: object

Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do

然后,利用可迭代对象的 Python 元组解包来做一个简单的事情

>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df

      AB  AB_split   A   B
0  A1-B1  [A1, B1]  A1  B1
1  A2-B2  [A2, B2]  A2  B2

Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split()method can do it for you with the expand=Trueparameter:

当然,从拆分一列字符串中获取 DataFrame 非常有用,该.str.split()方法可以使用expand=True参数为您完成:

>>> df['AB'].str.split('-', 1, expand=True)

    0   1
0  A1  B1
1  A2  B2

So, another way of accomplishing what we wanted is to do:

因此,实现我们想要的另一种方法是:

>>> df = df[['AB']]
>>> df

      AB
0  A1-B1
1  A2-B2

>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))

      AB   A   B
0  A1-B1  A1  B1
1  A2-B2  A2  B2

The expand=Trueversion, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:

expand=True版本虽然更长,但与元组解包方法相比具有明显的优势。元组解包不能很好地处理不同长度的拆分:

>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
         AB
0     A1-B1
1     A2-B2
2  A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
  [...]    
ValueError: Length of values does not match length of index
>>> 

But expand=Truehandles it nicely by placing Nonein the columns for which there aren't enough "splits":

但是expand=True通过放置None在没有足够“拆分”的列中来很好地处理它:

>>> df.join(
...     df['AB'].str.split('-', expand=True).rename(
...         columns={0:'A', 1:'B', 2:'C'}
...     )
... )
         AB   A   B     C
0     A1-B1  A1  B1  None
1     A2-B2  A2  B2  None
2  A3-B3-C3  A3  B3    C3

回答by Jasmine

If you want to split a string into more than two columns based on a delimiter you can omit the 'maximum splits' parameter.
You can use:

如果您想根据分隔符将字符串拆分为两列以上,则可以省略“最大拆分”参数。
您可以使用:

df['column_name'].str.split('/', expand=True)

This will automatically create as many columns as the maximum number of fields included in any of your initial strings.

这将自动创建与任何初始字符串中包含的最大字段数一样多的列。

回答by jezrael

You can use str.splitby whitespace (default separator) and parameter expand=Truefor DataFramewith assign to new columns:

您可以使用str.splitby 空格(默认分隔符)和参数expand=Truefor DataFramewith 分配给新列:

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL']})
print (df)
                        row
0       00000 UNITED STATES
1             01000 ALABAMA
2  01001 Autauga County, AL
3  01003 Baldwin County, AL
4  01005 Barbour County, AL



df[['a','b']] = df['row'].str.split(n=1, expand=True)
print (df)
                        row      a                   b
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

Modification if need remove original column with DataFrame.pop

修改如果需要删除原始列 DataFrame.pop

df[['a','b']] = df.pop('row').str.split(n=1, expand=True)
print (df)
       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

What is same like:

什么是相同的:

df[['a','b']] = df['row'].str.split(n=1, expand=True)
df = df.drop('row', axis=1)
print (df)

       a                   b
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL


If get error:

如果得到错误:

#remove n=1 for split by all whitespaces
df[['a','b']] = df['row'].str.split(expand=True)

ValueError: Columns must be same length as key

ValueError:列的长度必须与键的长度相同

You can check and it return 4 column DataFrame, not only 2:

您可以检查并返回 4 column DataFrame,而不仅仅是 2:

print (df['row'].str.split(expand=True))
       0        1        2     3
0  00000   UNITED   STATES  None
1  01000  ALABAMA     None  None
2  01001  Autauga  County,    AL
3  01003  Baldwin  County,    AL
4  01005  Barbour  County,    AL

Then solution is append new DataFrameby join:

然后解决方案是DataFrame通过join以下方式追加新的:

df = pd.DataFrame({'row': ['00000 UNITED STATES', '01000 ALABAMA', 
                           '01001 Autauga County, AL', '01003 Baldwin County, AL', 
                           '01005 Barbour County, AL'],
                    'a':range(5)})
print (df)
   a                       row
0  0       00000 UNITED STATES
1  1             01000 ALABAMA
2  2  01001 Autauga County, AL
3  3  01003 Baldwin County, AL
4  4  01005 Barbour County, AL

df = df.join(df['row'].str.split(expand=True))
print (df)

   a                       row      0        1        2     3
0  0       00000 UNITED STATES  00000   UNITED   STATES  None
1  1             01000 ALABAMA  01000  ALABAMA     None  None
2  2  01001 Autauga County, AL  01001  Autauga  County,    AL
3  3  01003 Baldwin County, AL  01003  Baldwin  County,    AL
4  4  01005 Barbour County, AL  01005  Barbour  County,    AL

With remove original column (if there are also another columns):

删除原始列(如果还有其他列):

df = df.join(df.pop('row').str.split(expand=True))
print (df)
   a      0        1        2     3
0  0  00000   UNITED   STATES  None
1  1  01000  ALABAMA     None  None
2  2  01001  Autauga  County,    AL
3  3  01003  Baldwin  County,    AL
4  4  01005  Barbour  County,    AL   

回答by mcchran

I prefer exporting the corresponding pandas series (i.e. the columns I need), using the applyfunction to split the column content into multiple series and then jointhe generated columns to the existing DataFrame. Of course, the source column should be removed.

我宁愿出口相应的熊猫系列(即我需要的列),使用应用功能进行拆分的列内容分为多个系列,然后加入生成的列到现有的数据帧。当然,应该删除源列。

e.g.

例如

 col1 = df["<col_name>"].apply(<function>)
 col2 = ...
 df = df.join(col1.to_frame(name="<name1>"))
 df = df.join(col2.toframe(name="<name2>"))
 df = df.drop(["<col_name>"], axis=1)

To split two words strings function should be something like that:

拆分两个单词字符串函数应该是这样的:

lambda x: x.split(" ")[0] # for the first element
lambda x: x.split(" ")[-1] # for the last element

回答by Bhagabat Behera

df[['fips', 'row']] = df['row'].str.split(' ', n=1, expand=True)

回答by cs95

Surprised I haven't seen this one yet. If you only need twosplits, I highly recommend. . .

很惊讶我还没有看到这个。如果你只需要两个分割,我强烈推荐。. .

Series.str.partition

Series.str.partition

partitionperforms one split on the separator, and is generally quite performant.

partition在分隔符上执行一次拆分,并且通常非常高效。

df['row'].str.partition(' ')[[0, 2]]

       0                   2
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to rename the rows,

如果您需要重命名行,

df['row'].str.partition(' ')[[0, 2]].rename({0: 'fips', 2: 'row'}, axis=1)

    fips                 row
0  00000       UNITED STATES
1  01000             ALABAMA
2  01001  Autauga County, AL
3  01003  Baldwin County, AL
4  01005  Barbour County, AL

If you need to join this back to the original, use joinor concat:

如果您需要将其加入到原始版本中,请使用joinconcat

df.join(df['row'].str.partition(' ')[[0, 2]])

pd.concat([df, df['row'].str.partition(' ')[[0, 2]]], axis=1)

                        row      0                   2
0       00000 UNITED STATES  00000       UNITED STATES
1             01000 ALABAMA  01000             ALABAMA
2  01001 Autauga County, AL  01001  Autauga County, AL
3  01003 Baldwin County, AL  01003  Baldwin County, AL
4  01005 Barbour County, AL  01005  Barbour County, AL

回答by Vingt Cent

I saw that no one had used the slice method, so here I put my 2 cents here.

我看到没有人用过切片方法,所以我把我的 2 美分放在了这里。

df["<col_name>"].str.slice(stop=5)
df["<col_name>"].str.slice(start=6)

This method will create two new columns.

此方法将创建两个新列。