Python 将正则表达式应用于熊猫数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25292838/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
applying regex to a pandas dataframe
提问by itjcms18
I'm having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:
我在将正则表达式函数应用于 python 数据框中的列时遇到问题。这是我的数据框的头部:
Name Season School G MP FGA 3P 3PA 3P%
74 Joe Dumars 1982-83 McNeese State 29 NaN 487 5 8 0.625
84 Sam Vincent 1982-83 Michigan State 30 1066 401 5 11 0.455
176 Gerald Wilkins 1982-83 Chattanooga 30 820 350 0 2 0.000
177 Gerald Wilkins 1983-84 Chattanooga 23 737 297 3 10 0.300
243 Delaney Rudd 1982-83 Wake Forest 32 1004 324 13 29 0.448
I thought I had a pretty good grasp of applying functions to Dataframes, so maybe my Regex skills are lacking.
我认为我已经很好地掌握了将函数应用于 Dataframes 的知识,所以也许我的 Regex 技能缺乏。
Here is what I put together:
这是我整理的内容:
import re
def split_it(year):
return re.findall('(\d\d\d\d)', year)
df['Season2'] = df['Season'].apply(split_it(x))
TypeError: expected string or buffer
Output would be a column called Season2 that contains the year before the hyphen. I'm sure theres an easier way to do it without regex, but more importantly, i'm trying to figure out what I did wrong
输出将是一个名为 Season2 的列,其中包含连字符之前的年份。我确信没有正则表达式有更简单的方法,但更重要的是,我试图弄清楚我做错了什么
Thanks for any help in advance.
提前感谢您的任何帮助。
采纳答案by DSM
When I try (a variant of) your code I get NameError: name 'x' is not defined-- which it isn't.
当我尝试(一种变体)您的代码时,我得到了NameError: name 'x' is not defined- 事实并非如此。
You could use either
你可以使用
df['Season2'] = df['Season'].apply(split_it)
or
或者
df['Season2'] = df['Season'].apply(lambda x: split_it(x))
but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:
但第二个只是编写第一个的更长更慢的方式,所以没有太大意义(除非你有其他参数要处理,我们这里没有。)不过,你的函数将返回一个list:
>>> df["Season"].apply(split_it)
74 [1982]
84 [1982]
176 [1982]
177 [1983]
243 [1982]
Name: Season, dtype: object
although you could easily change that. FWIW, I'd use vectorized string operations and do something like
虽然你可以很容易地改变它。FWIW,我会使用矢量化字符串操作并执行类似的操作
>>> df["Season"].str[:4].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
or
或者
>>> df["Season"].str.split("-").str[0].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
回答by Pratik409
The asked problem can be solved by writing the following code :
所问的问题可以通过编写以下代码来解决:
import re
def split_it(year):
x = re.findall('([\d]{4})', year)
if x :
return(x.group())
df['Season2'] = df['Season'].apply(split_it)
You were facing this problem as some rows didn't had year in the string
您正面临这个问题,因为某些行在字符串中没有年份
回答by Tony Alleven
I had the exact same issue. Thanks for the answers @DSM.
FYI @itjcms, you can improve the function by removing the repetition of the '\d\d\d\d'.
我有完全相同的问题。感谢@DSM 的回答。仅供参考@itjcms,您可以通过删除重复的'\d\d\d\d'.
def split_it(year):
return re.findall('(\d\d\d\d)', year)
Becomes:
变成:
def split_it(year):
return re.findall('(\d{4})', year)
回答by Gabriel
You can simply use str.extract
你可以简单地使用 str.extract
df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')
Here you locate \d{4}-\d{2}(for example 1982-83) but only extracts the captured group between parenthesis \d{4}(for example 1982)
在这里您找到\d{4}-\d{2}(例如 1982-83)但只提取括号之间的捕获组\d{4}(例如 1982)

