Python 将正则表达式应用于熊猫数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25292838/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 20:01:19  来源:igfitidea点击:

applying regex to a pandas dataframe

pythonregexpandas

提问by itjcms18

I'm having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:

我在将正则表达式函数应用于 python 数据框中的列时遇到问题。这是我的数据框的头部:

               Name   Season          School   G    MP  FGA  3P  3PA    3P%
 74       Joe Dumars  1982-83   McNeese State  29   NaN  487   5    8  0.625   
 84      Sam Vincent  1982-83  Michigan State  30  1066  401   5   11  0.455   
 176  Gerald Wilkins  1982-83     Chattanooga  30   820  350   0    2  0.000   
 177  Gerald Wilkins  1983-84     Chattanooga  23   737  297   3   10  0.300   
 243    Delaney Rudd  1982-83     Wake Forest  32  1004  324  13   29  0.448  

I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.

我认为我已经很好地掌握了将函数应用于 Dataframes 的知识,所以也许我的 Regex 技能缺乏。

Here is what I put together:

这是我整理的内容:

import re

def split_it(year):
    return re.findall('(\d\d\d\d)', year)

 df['Season2'] = df['Season'].apply(split_it(x))

TypeError: expected string or buffer

Output would be a column called Season2 that contains the year before the hyphen. I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong

输出将是一个名为 Season2 的列,其中包含连字符之前的年份。我确信没有正则表达式有更简单的方法,但更重要的是,我试图弄清楚我做错了什么

Thanks for any help in advance.

提前感谢您的任何帮助。

采纳答案by DSM

When I try (a variant of) your code I get NameError: name 'x' is not defined-- which it isn't.

当我尝试(一种变体)您的代码时,我得到了NameError: name 'x' is not defined- 事实并非如此。

You could use either

你可以使用

df['Season2'] = df['Season'].apply(split_it)

or

或者

df['Season2'] = df['Season'].apply(lambda x: split_it(x))

but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:

但第二个只是编写第一个的更长更慢的方式,所以没有太大意义(除非你有其他参数要处理,我们这里没有。)不过,你的函数将返回一个list

>>> df["Season"].apply(split_it)
74     [1982]
84     [1982]
176    [1982]
177    [1983]
243    [1982]
Name: Season, dtype: object

although you could easily change that. FWIW, I'd use vectorized string operations and do something like

虽然你可以很容易地改变它。FWIW,我会使用矢量化字符串操作并执行类似的操作

>>> df["Season"].str[:4].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64

or

或者

>>> df["Season"].str.split("-").str[0].astype(int)
74     1982
84     1982
176    1982
177    1983
243    1982
Name: Season, dtype: int64

回答by Pratik409

The asked problem can be solved by writing the following code :

所问的问题可以通过编写以下代码来解决:

import re
def split_it(year):
    x = re.findall('([\d]{4})', year)
    if x :
      return(x.group())

df['Season2'] = df['Season'].apply(split_it)

You were facing this problem as some rows didn't had year in the string

您正面临这个问题,因为某些行在字符串中没有年份

回答by Tony Alleven

I had the exact same issue. Thanks for the answers @DSM. FYI @itjcms, you can improve the function by removing the repetition of the '\d\d\d\d'.

我有完全相同的问题。感谢@DSM 的回答。仅供参考@itjcms,您可以通过删除重复的'\d\d\d\d'.

def split_it(year):  
    return re.findall('(\d\d\d\d)', year)

Becomes:

变成:

def split_it(year):
    return re.findall('(\d{4})', year)

回答by Gabriel

You can simply use str.extract

你可以简单地使用 str.extract

df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')

Here you locate \d{4}-\d{2}(for example 1982-83) but only extracts the captured group between parenthesis \d{4}(for example 1982)

在这里您找到\d{4}-\d{2}(例如 1982-83)但只提取括号之间的捕获组\d{4}(例如 1982)