使用正则表达式在 Pandas 数据框中创建新列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/46350705/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Creating New Column In Pandas Dataframe Using Regex
提问by Cam8593
I have a column in a pandas df of type object
that I want to parse to get the first number in the string, and create a new column containing that number as an int
.
我在 Pandas df 类型object
中有一个列,我想解析它以获得字符串中的第一个数字,并创建一个包含该数字的新列作为int
.
For example:
例如:
Existing df
现有的df
col
'foo 12 bar 8'
'bar 3 foo'
'bar 32bar 98'
Desired df
所需的 df
col col1
'foo 12 bar 8' 12
'bar 3 foo' 3
'bar 32bar 98' 32
I have code that works on any individual cell in the column series
我有适用于列系列中任何单个单元格的代码
int(re.search(r'\d+', df.iloc[0]['col']).group())
int(re.search(r'\d+', df.iloc[0]['col']).group())
The above code works fine and returns 12 as it should. But when I try to create a new column using the whole series:
上面的代码工作正常并返回 12 。但是当我尝试使用整个系列创建一个新列时:
df['col1'] = int(re.search(r'\d+', df['col']).group())
df['col1'] = int(re.search(r'\d+', df['col']).group())
I get the following Error:
我收到以下错误:
TypeError: expected string or bytes-like object
类型错误:预期的字符串或类似字节的对象
I tried wrapping a str()
around df['col']
which got rid of the error but yielded all 0's in col1
我想一个包裹str()
围绕df['col']
这摆脱了错误,但取得了全0在COL1
I've also tried converting col
to a list
of strings and iterating through the list
, which only yields the same error. Does anyone know what I'm doing wrong? Help would be much appreciated.
我还尝试转换col
为 a list
of 字符串并遍历list
,这只会产生相同的错误。有谁知道我做错了什么?帮助将不胜感激。
回答by Albo
This will do the trick:
这将解决问题:
search = []
for values in df['col']:
search.append(re.search(r'\d+', values).group())
df['col1'] = search
the output looks like this:
输出如下所示:
col col1
0 foo 12 bar 8 12
1 bar 3 foo 3
2 bar 32bar 98 32