检查列表中的单词并删除 Pandas 数据框列中的单词

Question

提问by haimen

I have a list as follows,

我有一个列表如下，

remove_words = ['abc', 'deff', 'pls']

The following is the data frame which I am having with column name 'string'

以下是我拥有的列名为“字符串”的数据框

     data['string']

0    abc stack overflow
1    abc123
2    deff comedy
3    definitely
4    pls lkjh
5    pls1234

I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.

我想从 pandas 数据帧列中的 remove_words 列表中检查单词，并在 pandas 数据帧中删除这些单词。我想检查单独出现的单词而不与其他单词一起出现。

For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,

例如，如果pandas df 列中有'abc'，则将其替换为''，但如果出现在abc123 中，则需要保持原样。这里的输出应该是，

     data['string']

0    stack overflow
1    abc123
2    comedy
3    definitely
4    lkjh
5    pls1234

In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.

在我的实际数据中，remove_words 列表中有 2000 个单词，pandas 数据框中有 50 亿条记录。所以我正在寻找最有效的方法来做到这一点。

I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.

我在 python 中尝试了一些东西，但没有取得太大的成功。有人可以帮我做这件事吗？任何想法都会有所帮助。

Thanks

谢谢

Answer 1

回答by MaxU

Try this:

尝试这个：

In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))

In [99]: pat
Out[99]: '\b(?:abc|def|pls)\b'

In [100]: df['new'] = df['string'].str.replace(pat, '')

In [101]: df
Out[101]:
               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

Answer 2

回答by piRSquared

Totally taking @MaxU's pattern!

完全采用@MaxU 的模式！

We can use pd.DataFrame.replaceby setting the regexparameter to Trueand passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.

我们可以pd.DataFrame.replace通过将regex参数设置为True并传递一个字典字典来使用，该字典指定模式以及每列替换的内容。

pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])

df.assign(new=df.replace(dict(string={pat: ''}), regex=True))

               string              new
0  abc stack overflow   stack overflow
1              abc123           abc123
2          def comedy           comedy
3          definitely       definitely
4            pls lkjh             lkjh
5             pls1234          pls1234

检查列表中的单词并删除 Pandas 数据框列中的单词

提问by haimen

回答by MaxU

回答by piRSquared

相关推荐

最近更新

标签

检查列表中的单词并删除 Pandas 数据框列中的单词

提问by haimen

回答by MaxU

回答by piRSquared

相关推荐

Pandas：如何获取数据帧第一行和最后一行的键（索引）

to_sql pandas 数据框导入 SQL 服务器错误：DatabaseError

通过整数访问行和通过标签 Pandas 访问列

pandas 熊猫 - 删除列

相关推荐

最近更新

标签