使用正则表达式在 Pandas 数据框中创建新列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/46350705/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:30:27  来源:igfitidea点击:

Creating New Column In Pandas Dataframe Using Regex

pythonregexpandas

提问by Cam8593

I have a column in a pandas df of type objectthat I want to parse to get the first number in the string, and create a new column containing that number as an int.

我在 Pandas df 类型object中有一个列,我想解析它以获得字符串中的第一个数字,并创建一个包含该数字的新列作为int.

For example:

例如:

Existing df

现有的df

    col
    'foo 12 bar 8'
    'bar 3 foo'
    'bar 32bar 98'

Desired df

所需的 df

    col               col1
    'foo 12 bar 8'    12
    'bar 3 foo'       3
    'bar 32bar 98'    32

I have code that works on any individual cell in the column series

我有适用于列系列中任何单个单元格的代码

int(re.search(r'\d+', df.iloc[0]['col']).group())

int(re.search(r'\d+', df.iloc[0]['col']).group())

The above code works fine and returns 12 as it should. But when I try to create a new column using the whole series:

上面的代码工作正常并返回 12 。但是当我尝试使用整个系列创建一个新列时:

df['col1'] = int(re.search(r'\d+', df['col']).group())

df['col1'] = int(re.search(r'\d+', df['col']).group())

I get the following Error:

我收到以下错误:

TypeError: expected string or bytes-like object

类型错误:预期的字符串或类似字节的对象

I tried wrapping a str()around df['col']which got rid of the error but yielded all 0's in col1

我想一个包裹str()围绕df['col']这摆脱了错误,但取得了全0在COL1

I've also tried converting colto a listof strings and iterating through the list, which only yields the same error. Does anyone know what I'm doing wrong? Help would be much appreciated.

我还尝试转换col为 a listof 字符串并遍历list,这只会产生相同的错误。有谁知道我做错了什么?帮助将不胜感激。

回答by Albo

This will do the trick:

这将解决问题:

search = []    
for values in df['col']:
    search.append(re.search(r'\d+', values).group())

df['col1'] = search

the output looks like this:

输出如下所示:

            col    col1
0  foo 12 bar 8      12
1     bar 3 foo       3
2  bar 32bar 98      32