pandas 删除熊猫数据框中每一行的标点符号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33047818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
remove punctuation for each row in a pandas data frame
提问by RJL
I am new to python so this may be a very basic question. I am trying to use lambda to remove punctuation for each row in a pandas dataframe. I used the following, but received an error. I am trying to avoid having convert the df into a list then append the cleaned results into new list, then convert it back to a df.
我是 python 新手,所以这可能是一个非常基本的问题。我正在尝试使用 lambda 来删除 Pandas 数据框中每一行的标点符号。我使用了以下内容,但收到错误消息。我试图避免将 df 转换为列表,然后将清理后的结果附加到新列表中,然后将其转换回 df。
Any suggestions would be appreciated!
任何建议,将不胜感激!
import string
df['cleaned'] = df['old'].apply(lambda x: x.replace(c,'') for c in string.punctuation)
回答by mechanical_meat
You need to iterate over the string in the dataframe, not over string.punctuation. You also need to build the string back up using .join().
您需要遍历数据帧中的字符串,而不是遍历string.punctuation. 您还需要使用.join().
df['cleaned'] = df['old'].apply(lambda x:''.join([i for i in x
if i not in string.punctuation]))
When lambda expressions get long like that it can be more readable to write out the function definition separately, e.g. (thanks to @AndyHayden for the optimization tips):
当 lambda 表达式变得如此长时,单独写出函数定义会更具可读性,例如(感谢@AndyHayden 的优化提示):
def remove_punctuation(s):
s = ''.join([i for i in s if i not in frozenset(string.punctuation)])
return s
df['cleaned'] = df['old'].apply(remove_punctuation)
回答by Andy Hayden
Using a regex will most likely be faster here:
在这里使用正则表达式很可能会更快:
In [11]: RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation]) # perhaps this is available in the re/regex library?
In [12]: s = pd.Series(["a..b", "c<=d", "e|}f"])
In [13]: s.str.replace(RE_PUNCTUATION, "")
Out[13]:
0 ab
1 cd
2 ef
dtype: object

