检查列表中的单词并删除 Pandas 数据框列中的单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45447848/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Check for words from list and remove those words in pandas dataframe column
提问by haimen
I have a list as follows,
我有一个列表如下,
remove_words = ['abc', 'deff', 'pls']
The following is the data frame which I am having with column name 'string'
以下是我拥有的列名为“字符串”的数据框
data['string']
0 abc stack overflow
1 abc123
2 deff comedy
3 definitely
4 pls lkjh
5 pls1234
I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.
我想从 pandas 数据帧列中的 remove_words 列表中检查单词,并在 pandas 数据帧中删除这些单词。我想检查单独出现的单词而不与其他单词一起出现。
For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,
例如,如果pandas df 列中有'abc',则将其替换为'',但如果出现在abc123 中,则需要保持原样。这里的输出应该是,
data['string']
0 stack overflow
1 abc123
2 comedy
3 definitely
4 lkjh
5 pls1234
In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.
在我的实际数据中,remove_words 列表中有 2000 个单词,pandas 数据框中有 50 亿条记录。所以我正在寻找最有效的方法来做到这一点。
I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.
我在 python 中尝试了一些东西,但没有取得太大的成功。有人可以帮我做这件事吗?任何想法都会有所帮助。
Thanks
谢谢
回答by MaxU
Try this:
尝试这个:
In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))
In [99]: pat
Out[99]: '\b(?:abc|def|pls)\b'
In [100]: df['new'] = df['string'].str.replace(pat, '')
In [101]: df
Out[101]:
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
回答by piRSquared
Totally taking @MaxU's pattern!
完全采用@MaxU 的模式!
We can use pd.DataFrame.replace
by setting the regex
parameter to True
and passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.
我们可以pd.DataFrame.replace
通过将regex
参数设置为True
并传递一个字典字典来使用,该字典指定模式以及每列替换的内容。
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
df.assign(new=df.replace(dict(string={pat: ''}), regex=True))
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234