检查列表中的单词并删除 Pandas 数据框列中的单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45447848/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:09:10  来源:igfitidea点击:

Check for words from list and remove those words in pandas dataframe column

pythonregexpython-2.7pandasreplace

提问by haimen

I have a list as follows,

我有一个列表如下,

remove_words = ['abc', 'deff', 'pls']

The following is the data frame which I am having with column name 'string'

以下是我拥有的列名为“字符串”的数据框

     data['string']

0    abc stack overflow
1    abc123
2    deff comedy
3    definitely
4    pls lkjh
5    pls1234

I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.

我想从 pandas 数据帧列中的 remove_words 列表中检查单词,并在 pandas 数据帧中删除这些单词。我想检查单独出现的单词而不与其他单词一起出现。

For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,

例如,如果pandas df 列中有'abc',则将其替换为'',但如果出现在abc123 中,则需要保持原样。这里的输出应该是,

     data['string']

0    stack overflow
1    abc123
2    comedy
3    definitely
4    lkjh
5    pls1234

In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.

在我的实际数据中,remove_words 列表中有 2000 个单词,pandas 数据框中有 50 亿条记录。所以我正在寻找最有效的方法来做到这一点。

I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.

我在 python 中尝试了一些东西,但没有取得太大的成功。有人可以帮我做这件事吗?任何想法都会有所帮助。

Thanks

谢谢

回答by MaxU

Try this:

尝试这个:

In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))

In [99]: pat
Out[99]: '\b(?:abc|def|pls)\b'

In [100]: df['new'] = df['string'].str.replace(pat, '')

In [101]: df
Out[101]:
               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

回答by piRSquared

Totally taking @MaxU's pattern!

完全采用@MaxU 的模式!

We can use pd.DataFrame.replaceby setting the regexparameter to Trueand passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.

我们可以pd.DataFrame.replace通过将regex参数设置为True并传递一个字典字典来使用,该字典指定模式以及每列替换的内容。

pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])

df.assign(new=df.replace(dict(string={pat: ''}), regex=True))

               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234