pandas 删除熊猫数据框中每一行的标点符号

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/33047818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 00:01:13  来源:igfitidea点击:

remove punctuation for each row in a pandas data frame

pythonpandasdataframe

提问by RJL

I am new to python so this may be a very basic question. I am trying to use lambda to remove punctuation for each row in a pandas dataframe. I used the following, but received an error. I am trying to avoid having convert the df into a list then append the cleaned results into new list, then convert it back to a df.

我是 python 新手,所以这可能是一个非常基本的问题。我正在尝试使用 lambda 来删除 Pandas 数据框中每一行的标点符号。我使用了以下内容,但收到错误消息。我试图避免将 df 转换为列表,然后将清理后的结果附加到新列表中,然后将其转换回 df。

Any suggestions would be appreciated!

任何建议,将不胜感激!

import string

df['cleaned'] = df['old'].apply(lambda x: x.replace(c,'') for c in string.punctuation)

回答by mechanical_meat

You need to iterate over the string in the dataframe, not over string.punctuation. You also need to build the string back up using .join().

您需要遍历数据帧中的字符串,而不是遍历string.punctuation. 您还需要使用.join().

df['cleaned'] = df['old'].apply(lambda x:''.join([i for i in x 
                                                  if i not in string.punctuation]))

When lambda expressions get long like that it can be more readable to write out the function definition separately, e.g. (thanks to @AndyHayden for the optimization tips):

当 lambda 表达式变得如此长时,单独写出函数定义会更具可读性,例如(感谢@AndyHayden 的优化提示):

def remove_punctuation(s):
    s = ''.join([i for i in s if i not in frozenset(string.punctuation)])
    return s

df['cleaned'] = df['old'].apply(remove_punctuation)

回答by Andy Hayden

Using a regex will most likely be faster here:

在这里使用正则表达式很可能会更快:

In [11]: RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])  # perhaps this is available in the re/regex library?

In [12]: s = pd.Series(["a..b", "c<=d", "e|}f"])

In [13]: s.str.replace(RE_PUNCTUATION, "")
Out[13]:
0    ab
1    cd
2    ef
dtype: object