使用 Pandas 快速去除标点符号

Question

提问by cs95

This is a self-answered post. Below I outline a common problem in the NLP domain and propose a few performant methods to solve it.

这是一个自我回答的帖子。下面我概述了 NLP 领域中的一个常见问题，并提出了一些解决它的高效方法。

Oftentimes the need arises to remove punctuationduring text cleaning and pre-processing. Punctuation is defined as any character in string.punctuation:

通常需要在文本清理和预处理期间删除标点符号。标点符号定义为中的任何字符string.punctuation：

>>> import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'

This is a common enough problem and has been asked before ad nauseam. The most idiomatic solution uses pandas str.replace. However, for situations which involve a lotof text, a more performant solution may need to be considered.

这是一个很常见的问题，并且在令人作呕之前就已经被问到了。最惯用的解决方案是使用 pandas str.replace。但是，对于涉及大量文本的情况，可能需要考虑更高效的解决方案。

What are some good, performant alternatives to str.replacewhen dealing with hundreds of thousands of records?

str.replace在处理数十万条记录时，有哪些好的、高性能的替代方案？

Answer 1

回答by cs95

Setup

设置

For the purpose of demonstration, let's consider this DataFrame.

出于演示的目的，让我们考虑这个 DataFrame。

df = pd.DataFrame({'text':['a..b?!??', '%hgh&12','abc123!!!', '$$34']})
df
        text
0   a..b?!??
1    %hgh&12
2  abc123!!!
3    $$34

Below, I list the alternatives, one by one, in increasing order of performance

下面，我按照性能的升序一一列出了替代方案

`str.replace`

This option is included to establish the default method as a benchmark for comparing other, more performant solutions.

包含此选项是为了建立默认方法作为比较其他更高效解决方案的基准。

This uses pandas in-built str.replacefunction which performs regex-based replacement.

这使用了str.replace执行基于正则表达式的替换的 Pandas内置函数。

df['text'] = df['text'].str.replace(r'[^\w\s]+', '')

df
     text
0      ab
1   hgh12
2  abc123
3    1234

This is very easy to code, and is quite readable, but slow.

这很容易编码，并且可读性很强，但速度很慢。

`regex.sub`

This involves using the subfunction from the relibrary. Pre-compile a regex pattern for performance, and call regex.subinside a list comprehension. Convert df['text']to a list beforehand if you can spare some memory, you'll get a nice little performance boost out of this.

这涉及使用库中的sub函数re。预编译正则表达式模式以提高性能，并regex.sub在列表理解中调用。df['text']如果您可以节省一些内存，则事先转换为列表，您会从中获得不错的性能提升。

import re
p = re.compile(r'[^\w\s]+')
df['text'] = [p.sub('', x) for x in df['text'].tolist()]

df
     text
0      ab
1   hgh12
2  abc123
3    1234

Note:If your data has NaN values, this (as well as the next method below) will not work as is. See the section on "Other Considerations".

注意：如果您的数据具有 NaN 值，则此方法（以及下面的下一个方法）将无法按原样工作。请参阅“其他注意事项”部分。

`str.translate`

python's str.translatefunction is implemented in C, and is therefore very fast.

python的str.translate函数是用C实现的，因此速度非常快。

How this works is:

这是如何工作的：

First, join all your strings together to form one hugestring using a single (or more) character separatorthat youchoose. You mustuse a character/substring that you can guarantee will not belong inside your data.
Perform str.translateon the large string, removing punctuation (the separator from step 1 excluded).
Split the string on the separator that was used to join in step 1. The resultant list musthave the same length as your initial column.

首先，使用您选择的单个（或多个）字符分隔符将所有字符串连接在一起形成一个巨大的字符串。您必须使用可以保证不属于您的数据的字符/子字符串。
str.translate对大字符串执行，删除标点符号（排除步骤 1 中的分隔符）。
在第 1 步中用于连接的分隔符上拆分字符串。结果列表的长度必须与初始列的长度相同。

Here, in this example, we consider the pipe separator |. If your data contains the pipe, then you must choose another separator.

在此示例中，我们考虑管道分隔符|。如果您的数据包含管道，则您必须选择另一个分隔符。

import string

punct = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{}~'   # `|` is not present here
transtab = str.maketrans(dict.fromkeys(punct, ''))

df['text'] = '|'.join(df['text'].tolist()).translate(transtab).split('|')

df
     text
0      ab
1   hgh12
2  abc123
3    1234

Performance

表现

str.translateperforms the best, by far. Note that the graph below includes another variant Series.str.translatefrom MaxU's answer.

str.translate到目前为止，表现最好。请注意，下面的图表包括另一种变体Series.str.translate从MaxU的答案。

(Interestingly, I reran this a second time, and the results are slightly different from before. During the second run, it seems re.subwas winning out over str.translatefor really small amounts of data.)

（有趣的是，我第二次重新运行，结果与之前略有不同。在第二次运行期间，它似乎re.sub在str.translate非常少量的数据上胜出。）

There is an inherent risk involved with using translate(particularly, the problem of automatingthe process of deciding which separator to use is non-trivial), but the trade-offs are worth the risk.

使用存在固有风险translate（特别是，自动化决定使用哪个分隔符的过程的问题非常重要），但权衡取舍是值得的。

Other Considerations

其他注意事项

Handling NaNs with list comprehension methods;Note that this method (and the next) will only work as long as your data does not have NaNs. When handling NaNs, you will have to determine the indices of non-null values and replace those only. Try something like this:

使用列表理解方法处理 NaN；请注意，此方法（和下一个）仅在您的数据没有 NaN 时才有效。处理 NaN 时，您必须确定非空值的索引并仅替换它们。尝试这样的事情：

df = pd.DataFrame({'text': [
    'a..b?!??', np.nan, '%hgh&12','abc123!!!', '$$34', np.nan]})

idx = np.flatnonzero(df['text'].notna())
col_idx = df.columns.get_loc('text')
df.iloc[idx,col_idx] = [
    p.sub('', x) for x in df.iloc[idx,col_idx].tolist()]

df
     text
0      ab
1     NaN
2   hgh12
3  abc123
4    1234
5     NaN

Dealing with DataFrames;If you are dealing with DataFrames, where everycolumn requires replacement, the procedure is simple:

处理数据帧；如果您正在处理 DataFrames，其中每一列都需要替换，则过程很简单：

v = pd.Series(df.values.ravel())
df[:] = translate(v).values.reshape(df.shape)

Or,

或者，

v = df.stack()
v[:] = translate(v)
df = v.unstack()

Note that the translatefunction is defined below in with the benchmarking code.

请注意，该translate函数在下面的基准测试代码中定义。

Every solution has tradeoffs, so deciding what solution best fits your needs will depend on what you're willing to sacrifice. Two very common considerations are performance (which we've already seen), and memory usage. str.translateis a memory-hungry solution, so use with caution.

每个解决方案都有权衡，因此决定哪种解决方案最适合您的需求将取决于您愿意牺牲什么。两个非常常见的考虑因素是性能（我们已经看到）和内存使用。str.translate是一种占用大量内存的解决方案，因此请谨慎使用。

Another consideration is the complexity of your regex. Sometimes, you may want to remove anything that is not alphanumeric or whitespace. Othertimes, you will need to retain certain characters, such as hyphens, colons, and sentence terminators [.!?]. Specifying these explicitly add complexity to your regex, which may in turn impact the performance of these solutions. Make sure you test these solutions on your data before deciding what to use.

另一个考虑因素是正则表达式的复杂性。有时，您可能想要删除不是字母数字或空格的任何内容。其他时候，您需要保留某些字符，例如连字符、冒号和句子终止符[.!?]。明确指定这些会增加正则表达式的复杂性，这反过来可能会影响这些解决方案的性能。在决定使用什么之前，请确保在您的数据上测试这些解决方案。

Lastly, unicode characters will be removed with this solution. You may want to tweak your regex (if using a regex-based solution), or just go with str.translateotherwise.

最后，此解决方案将删除 unicode 字符。您可能想要调整您的正则表达式（如果使用基于正则表达式的解决方案），或者只是采用str.translate其他方式。

For even moreperformance (for larger N), take a look at this answer by Paul Panzer.

为了获得更高的性能（对于更大的 N），请查看Paul Panzer 的这个答案。

Appendix

附录

Functions

职能

def pd_replace(df):
    return df.assign(text=df['text'].str.replace(r'[^\w\s]+', ''))


def re_sub(df):
    p = re.compile(r'[^\w\s]+')
    return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])

def translate(df):
    punct = string.punctuation.replace('|', '')
    transtab = str.maketrans(dict.fromkeys(punct, ''))

    return df.assign(
        text='|'.join(df['text'].tolist()).translate(transtab).split('|')
    )

# MaxU's version (https://stackoverflow.com/a/50444659/4909087)
def pd_translate(df):
    punct = string.punctuation.replace('|', '')
    transtab = str.maketrans(dict.fromkeys(punct, ''))

    return df.assign(text=df['text'].str.translate(transtab))

Performance Benchmarking Code

性能基准代码

from timeit import timeit

import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['pd_replace', 're_sub', 'translate', 'pd_translate'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        l = ['a..b?!??', '%hgh&12','abc123!!!', '$$34'] * c
        df = pd.DataFrame({'text' : l})
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=30)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

Answer 2

回答by Paul Panzer

Using numpy we can gain a healthy speedup over the best methods posted so far. The basic strategy is similar---make one big super string. But the processing seems much faster in numpy, presumably because we fully exploit the simplicity of the nothing-for-something replacement op.

使用 numpy 我们可以获得比迄今为止发布的最佳方法更健康的加速。基本策略类似---做一个大的超级字符串。但是在 numpy 中处理似乎要快得多，大概是因为我们充分利用了无为替换操作的简单性。

For smaller (less than 0x110000characters total) problems we automatically find a separator, for larger problems we use a slower method that does not rely on str.split.

对于较小（少于0x110000字符总数）的问题，我们会自动找到一个分隔符，对于较大的问题，我们使用不依赖于str.split.

Note that I have moved all precomputables out of the functions. Also note, that translateand pd_translateget to know the only possible separator for the three largest problems for free whereas np_multi_strathas to compute it or to fall back to the separator-less strategy. And finally, note that for the last three data points I switch to a more "interesting" problem; pd_replaceand re_subbecause they are not equivalent to the other methods had to be excluded for that.

请注意，我已将所有预计算项移出函数。另请注意，translate并pd_translate免费了解三个最大问题的唯一可能分隔符，而np_multi_strat必须计算它或退回到无分隔符策略。最后，请注意，对于最后三个数据点，我切换到一个更“有趣”的问题；pd_replace并且re_sub因为它们不等同于其他方法，所以必须为此排除在外。

On the algorithm:

关于算法：

The basic strategy is actually quite simple. There are only 0x110000different unicode characters. As OP frames the challenge in terms of huge data sets, it is perfectly worthwhile making a lookup table that has Trueat the character id's that we want to keep and Falseat the ones that have to go --- the punctuation in our example.

基本的策略其实很简单。只有0x110000不同的 unicode 字符。由于 OP 在庞大的数据集方面提出了挑战，因此非常值得制作一个查找表，其中包含True我们想要保留的字符 ID 和必须删除的字符 ID False--- 我们示例中的标点符号。

Such a lookup table can be used for bulk loookup using numpy's advanced indexing. As lookup is fully vectorized and essentially amounts to dereferencing an array of pointers it is much faster than for example dictionary lookup. Here we make use of numpy view casting which allows to reinterpret unicode characters as integers essentially for free.

这样的查找表可用于使用 numpy 的高级索引进行批量查找。由于查找是完全矢量化的并且本质上相当于取消引用指针数组，因此它比例如字典查找要快得多。在这里，我们使用 numpy 视图转换，它允许将 unicode 字符重新解释为本质上免费的整数。

Using the data array which contains just one monster string reinterpreted as a sequence of numbers to index into the lookup table results in a boolean mask. This mask can then be used to filter out the unwanted characters. Using boolean indexing this, too, is a single line of code.

使用仅包含一个怪物字符串的数据数组被重新解释为一系列数字以索引到查找表中，结果是一个布尔掩码。然后可以使用此掩码过滤掉不需要的字符。使用布尔索引，这也是一行代码。

So far so simple. The tricky bit is chopping up the monster string back into its parts. If we have a separator, i.e. one character that does not occur in the data or the punctuation list, then it still is easy. Use this character to join and resplit. However, automatically finding a separator is challenging and indeed accounts for half the loc in the implementation below.

到目前为止如此简单。棘手的一点是将怪物弦切回原处。如果我们有一个分隔符，即一个没有出现在数据或标点列表中的字符，那么它仍然很容易。使用此字符加入和重新拆分。然而，自动找到一个分隔符是具有挑战性的，并且在下面的实现中确实占了 loc 的一半。

Alternatively, we can keep the split points in a separate data structure, track how they move as a consequence of deleting unwanted characters and then use them to slice the processed monster string. As chopping up into parts of uneven length is not numpy's strongest suit, this method is slower than str.splitand only used as a fallback when a separator would be too expensive to calculate if it existed in the first place.

或者，我们可以将分割点保留在单独的数据结构中，跟踪它们因删除不需要的字符而移动的方式，然后使用它们对处理过的怪物字符串进行切片。由于切碎成长度不均匀的部分并不是 numpy 的强项，因此这种方法比str.split分离器最初存在时计算成本太高而仅用作后备方法要慢，并且仅用作后备。

Code (timing/plotting heavily based on @COLDSPEED's post):

代码（主要基于@COLDSPEED 的帖子进行计时/绘图）：

import numpy as np
import pandas as pd
import string
import re


spct = np.array([string.punctuation]).view(np.int32)
lookup = np.zeros((0x110000,), dtype=bool)
lookup[spct] = True
invlookup = ~lookup
OSEP = spct[0]
SEP = chr(OSEP)
while SEP in string.punctuation:
    OSEP = np.random.randint(0, 0x110000)
    SEP = chr(OSEP)


def find_sep_2(letters):
    letters = np.array([letters]).view(np.int32)
    msk = invlookup.copy()
    msk[letters] = False
    sep = msk.argmax()
    if not msk[sep]:
        return None
    return sep

def find_sep(letters, sep=0x88000):
    letters = np.array([letters]).view(np.int32)
    cmp = np.sign(sep-letters)
    cmpf = np.sign(sep-spct)
    if cmp.sum() + cmpf.sum() >= 1:
        left, right, gs = sep+1, 0x110000, -1
    else:
        left, right, gs = 0, sep, 1
    idx, = np.where(cmp == gs)
    idxf, = np.where(cmpf == gs)
    sep = (left + right) // 2
    while True:
        cmp = np.sign(sep-letters[idx])
        cmpf = np.sign(sep-spct[idxf])
        if cmp.all() and cmpf.all():
            return sep
        if cmp.sum() + cmpf.sum() >= (left & 1 == right & 1):
            left, sep, gs = sep+1, (right + sep) // 2, -1
        else:
            right, sep, gs = sep, (left + sep) // 2, 1
        idx = idx[cmp == gs]
        idxf = idxf[cmpf == gs]

def np_multi_strat(df):
    L = df['text'].tolist()
    all_ = ''.join(L)
    sep = 0x088000
    if chr(sep) in all_: # very unlikely ...
        if len(all_) >= 0x110000: # fall back to separator-less method
                                  # (finding separator too expensive)
            LL = np.array((0, *map(len, L)))
            LLL = LL.cumsum()
            all_ = np.array([all_]).view(np.int32)
            pnct = invlookup[all_]
            NL = np.add.reduceat(pnct, LLL[:-1])
            NLL = np.concatenate([[0], NL.cumsum()]).tolist()
            all_ = all_[pnct]
            all_ = all_.view(f'U{all_.size}').item(0)
            return df.assign(text=[all_[NLL[i]:NLL[i+1]]
                                   for i in range(len(NLL)-1)])
        elif len(all_) >= 0x22000: # use mask
            sep = find_sep_2(all_)
        else: # use bisection
            sep = find_sep(all_)
    all_ = np.array([chr(sep).join(L)]).view(np.int32)
    pnct = invlookup[all_]
    all_ = all_[pnct]
    all_ = all_.view(f'U{all_.size}').item(0)
    return df.assign(text=all_.split(chr(sep)))

def pd_replace(df):
    return df.assign(text=df['text'].str.replace(r'[^\w\s]+', ''))


p = re.compile(r'[^\w\s]+')

def re_sub(df):
    return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])

punct = string.punctuation.replace(SEP, '')
transtab = str.maketrans(dict.fromkeys(punct, ''))

def translate(df):
    return df.assign(
        text=SEP.join(df['text'].tolist()).translate(transtab).split(SEP)
    )

# MaxU's version (https://stackoverflow.com/a/50444659/4909087)
def pd_translate(df):
    return df.assign(text=df['text'].str.translate(transtab))

from timeit import timeit

import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['translate', 'pd_replace', 're_sub', 'pd_translate', 'np_multi_strat'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000,
                1000000],
       dtype=float
)

for c in res.columns:
    if c >= 100000: # stress test the separator finder
        all_ = np.r_[:OSEP, OSEP+1:0x110000].repeat(c//10000)
        np.random.shuffle(all_)
        split = np.arange(c-1) + \
                np.sort(np.random.randint(0, len(all_) - c + 2, (c-1,))) 
        l = [x.view(f'U{x.size}').item(0) for x in np.split(all_, split)]
    else:
        l = ['a..b?!??', '%hgh&12','abc123!!!', '$$34'] * c
    df = pd.DataFrame({'text' : l})
    for f in res.index: 
        if f == res.index[0]:
            ref = globals()[f](df).text
        elif not (ref == globals()[f](df).text).all():
            res.at[f, c] = np.nan
            print(f, 'disagrees at', c)
            continue
        stmt = '{}(df)'.format(f)
        setp = 'from __main__ import df, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=16)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

Answer 3

回答by MaxU

Interesting enough that vectorized Series.str.translatemethod is still slightly slower compared to Vanilla Python str.translate():

有趣的是，矢量化的Series.str.translate方法与 Vanilla Python 相比仍然稍慢str.translate()：

def pd_translate(df):
    return df.assign(text=df['text'].str.translate(transtab))

使用 Pandas 快速去除标点符号

提问by cs95

回答by cs95

Setup

设置

`str.replace`

`str.replace`

`regex.sub`

`regex.sub`

`str.translate`

`str.translate`

Performance

表现

Other Considerations

其他注意事项

Appendix

附录

回答by Paul Panzer

On the algorithm:

关于算法：

回答by MaxU

相关推荐

最近更新

标签

使用 Pandas 快速去除标点符号

提问by cs95

回答by cs95

Setup

设置

str.replace

str.replace

regex.sub

regex.sub

str.translate

str.translate

Performance

表现

Other Considerations

其他注意事项

Appendix

附录

回答by Paul Panzer

On the algorithm:

关于算法：

回答by MaxU

相关推荐

Pandas - 根据百分比获取前 n 行

pandas 使用 get_dummies 时删除冗余列

pandas 在循环python中更改数据框列中的值

在 Dockerfile 中安装 Pandas

相关推荐

最近更新

标签

`str.replace`

`str.replace`

`regex.sub`

`regex.sub`

`str.translate`

`str.translate`