使用 Pandas 快速去除标点符号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50444346/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Fast punctuation removal with pandas
提问by cs95
This is a self-answered post. Below I outline a common problem in the NLP domain and propose a few performant methods to solve it.
这是一个自我回答的帖子。下面我概述了 NLP 领域中的一个常见问题,并提出了一些解决它的高效方法。
Oftentimes the need arises to remove punctuationduring text cleaning and pre-processing. Punctuation is defined as any character in string.punctuation
:
通常需要在文本清理和预处理期间删除标点符号。标点符号定义为 中的任何字符string.punctuation
:
>>> import string
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
This is a common enough problem and has been asked before ad nauseam. The most idiomatic solution uses pandas str.replace
. However, for situations which involve a lotof text, a more performant solution may need to be considered.
这是一个很常见的问题,并且在令人作呕之前就已经被问到了。最惯用的解决方案是使用 pandas str.replace
。但是,对于涉及大量文本的情况,可能需要考虑更高效的解决方案。
What are some good, performant alternatives to str.replace
when dealing with hundreds of thousands of records?
str.replace
在处理数十万条记录时,有哪些好的、高性能的替代方案?
回答by cs95
Setup
设置
For the purpose of demonstration, let's consider this DataFrame.
出于演示的目的,让我们考虑这个 DataFrame。
df = pd.DataFrame({'text':['a..b?!??', '%hgh&12','abc123!!!', '$$34']})
df
text
0 a..b?!??
1 %hgh&12
2 abc123!!!
3 $$34
Below, I list the alternatives, one by one, in increasing order of performance
下面,我按照性能的升序一一列出了替代方案
str.replace
str.replace
This option is included to establish the default method as a benchmark for comparing other, more performant solutions.
包含此选项是为了建立默认方法作为比较其他更高效解决方案的基准。
This uses pandas in-built str.replace
function which performs regex-based replacement.
这使用了str.replace
执行基于正则表达式的替换的 Pandas内置函数。
df['text'] = df['text'].str.replace(r'[^\w\s]+', '')
df
text
0 ab
1 hgh12
2 abc123
3 1234
This is very easy to code, and is quite readable, but slow.
这很容易编码,并且可读性很强,但速度很慢。
regex.sub
regex.sub
This involves using the sub
function from the re
library. Pre-compile a regex pattern for performance, and call regex.sub
inside a list comprehension. Convert df['text']
to a list beforehand if you can spare some memory, you'll get a nice little performance boost out of this.
这涉及使用库中的sub
函数re
。预编译正则表达式模式以提高性能,并regex.sub
在列表理解中调用。df['text']
如果您可以节省一些内存,则事先转换为列表,您会从中获得不错的性能提升。
import re
p = re.compile(r'[^\w\s]+')
df['text'] = [p.sub('', x) for x in df['text'].tolist()]
df
text
0 ab
1 hgh12
2 abc123
3 1234
Note:If your data has NaN values, this (as well as the next method below) will not work as is. See the section on "Other Considerations".
注意:如果您的数据具有 NaN 值,则此方法(以及下面的下一个方法)将无法按原样工作。请参阅“其他注意事项”部分。
str.translate
str.translate
python's str.translate
function is implemented in C, and is therefore very fast.
python的str.translate
函数是用C实现的,因此速度非常快。
How this works is:
这是如何工作的:
- First, join all your strings together to form one hugestring using a single (or more) character separatorthat youchoose. You mustuse a character/substring that you can guarantee will not belong inside your data.
- Perform
str.translate
on the large string, removing punctuation (the separator from step 1 excluded). - Split the string on the separator that was used to join in step 1. The resultant list musthave the same length as your initial column.
- 首先,使用您选择的单个(或多个)字符分隔符将所有字符串连接在一起形成一个巨大的字符串。您必须使用可以保证不属于您的数据的字符/子字符串。
str.translate
对大字符串执行,删除标点符号(排除步骤 1 中的分隔符)。- 在第 1 步中用于连接的分隔符上拆分字符串。结果列表的长度必须与初始列的长度相同。
Here, in this example, we consider the pipe separator |
. If your data contains the pipe, then you must choose another separator.
在此示例中,我们考虑管道分隔符|
。如果您的数据包含管道,则您必须选择另一个分隔符。
import string
punct = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{}~' # `|` is not present here
transtab = str.maketrans(dict.fromkeys(punct, ''))
df['text'] = '|'.join(df['text'].tolist()).translate(transtab).split('|')
df
text
0 ab
1 hgh12
2 abc123
3 1234
Performance
表现
str.translate
performs the best, by far. Note that the graph below includes another variant Series.str.translate
from MaxU's answer.
str.translate
到目前为止,表现最好。请注意,下面的图表包括另一种变体Series.str.translate
从MaxU的答案。
(Interestingly, I reran this a second time, and the results are slightly different from before. During the second run, it seems re.sub
was winning out over str.translate
for really small amounts of data.)
(有趣的是,我第二次重新运行,结果与之前略有不同。在第二次运行期间,它似乎re.sub
在str.translate
非常少量的数据上胜出。)
There is an inherent risk involved with using translate
(particularly, the problem of automatingthe process of deciding which separator to use is non-trivial), but the trade-offs are worth the risk.
使用存在固有风险translate
(特别是,自动化决定使用哪个分隔符的过程的问题非常重要),但权衡取舍是值得的。
Other Considerations
其他注意事项
Handling NaNs with list comprehension methods;Note that this method (and the next) will only work as long as your data does not have NaNs. When handling NaNs, you will have to determine the indices of non-null values and replace those only. Try something like this:
使用列表理解方法处理 NaN;请注意,此方法(和下一个)仅在您的数据没有 NaN 时才有效。处理 NaN 时,您必须确定非空值的索引并仅替换它们。尝试这样的事情:
df = pd.DataFrame({'text': [
'a..b?!??', np.nan, '%hgh&12','abc123!!!', '$$34', np.nan]})
idx = np.flatnonzero(df['text'].notna())
col_idx = df.columns.get_loc('text')
df.iloc[idx,col_idx] = [
p.sub('', x) for x in df.iloc[idx,col_idx].tolist()]
df
text
0 ab
1 NaN
2 hgh12
3 abc123
4 1234
5 NaN
Dealing with DataFrames;If you are dealing with DataFrames, where everycolumn requires replacement, the procedure is simple:
处理数据帧;如果您正在处理 DataFrames,其中每一列都需要替换,则过程很简单:
v = pd.Series(df.values.ravel())
df[:] = translate(v).values.reshape(df.shape)
Or,
或者,
v = df.stack()
v[:] = translate(v)
df = v.unstack()
Note that the translate
function is defined below in with the benchmarking code.
请注意,该translate
函数在下面的基准测试代码中定义。
Every solution has tradeoffs, so deciding what solution best fits your needs will depend on what you're willing to sacrifice. Two very common considerations are performance (which we've already seen), and memory usage. str.translate
is a memory-hungry solution, so use with caution.
每个解决方案都有权衡,因此决定哪种解决方案最适合您的需求将取决于您愿意牺牲什么。两个非常常见的考虑因素是性能(我们已经看到)和内存使用。str.translate
是一种占用大量内存的解决方案,因此请谨慎使用。
Another consideration is the complexity of your regex. Sometimes, you may want to remove anything that is not alphanumeric or whitespace. Othertimes, you will need to retain certain characters, such as hyphens, colons, and sentence terminators [.!?]
. Specifying these explicitly add complexity to your regex, which may in turn impact the performance of these solutions. Make sure you test these solutions
on your data before deciding what to use.
另一个考虑因素是正则表达式的复杂性。有时,您可能想要删除不是字母数字或空格的任何内容。其他时候,您需要保留某些字符,例如连字符、冒号和句子终止符[.!?]
。明确指定这些会增加正则表达式的复杂性,这反过来可能会影响这些解决方案的性能。在决定使用什么之前,请确保在您的数据上测试这些解决方案。
Lastly, unicode characters will be removed with this solution. You may want to tweak your regex (if using a regex-based solution), or just go with str.translate
otherwise.
最后,此解决方案将删除 unicode 字符。您可能想要调整您的正则表达式(如果使用基于正则表达式的解决方案),或者只是采用str.translate
其他方式。
For even moreperformance (for larger N), take a look at this answer by Paul Panzer.
为了获得更高的性能(对于更大的 N),请查看Paul Panzer 的这个答案。
Appendix
附录
Functions
职能
def pd_replace(df):
return df.assign(text=df['text'].str.replace(r'[^\w\s]+', ''))
def re_sub(df):
p = re.compile(r'[^\w\s]+')
return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])
def translate(df):
punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))
return df.assign(
text='|'.join(df['text'].tolist()).translate(transtab).split('|')
)
# MaxU's version (https://stackoverflow.com/a/50444659/4909087)
def pd_translate(df):
punct = string.punctuation.replace('|', '')
transtab = str.maketrans(dict.fromkeys(punct, ''))
return df.assign(text=df['text'].str.translate(transtab))
Performance Benchmarking Code
性能基准代码
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['pd_replace', 're_sub', 'translate', 'pd_translate'],
columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000],
dtype=float
)
for f in res.index:
for c in res.columns:
l = ['a..b?!??', '%hgh&12','abc123!!!', '$$34'] * c
df = pd.DataFrame({'text' : l})
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=30)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
回答by Paul Panzer
Using numpy we can gain a healthy speedup over the best methods posted so far. The basic strategy is similar---make one big super string. But the processing seems much faster in numpy, presumably because we fully exploit the simplicity of the nothing-for-something replacement op.
使用 numpy 我们可以获得比迄今为止发布的最佳方法更健康的加速。基本策略类似---做一个大的超级字符串。但是在 numpy 中处理似乎要快得多,大概是因为我们充分利用了无为替换操作的简单性。
For smaller (less than 0x110000
characters total) problems we automatically find a separator, for larger problems we use a slower method that does not rely on str.split
.
对于较小(少于0x110000
字符总数)的问题,我们会自动找到一个分隔符,对于较大的问题,我们使用不依赖于str.split
.
Note that I have moved all precomputables out of the functions. Also note, that translate
and pd_translate
get to know the only possible separator for the three largest problems for free whereas np_multi_strat
has to compute it or to fall back to the separator-less strategy. And finally, note that for the last three data points I switch to a more "interesting" problem; pd_replace
and re_sub
because they are not equivalent to the other methods had to be excluded for that.
请注意,我已将所有预计算项移出函数。另请注意,translate
并pd_translate
免费了解三个最大问题的唯一可能分隔符,而np_multi_strat
必须计算它或退回到无分隔符策略。最后,请注意,对于最后三个数据点,我切换到一个更“有趣”的问题;pd_replace
并且re_sub
因为它们不等同于其他方法,所以必须为此排除在外。
On the algorithm:
关于算法:
The basic strategy is actually quite simple. There are only 0x110000
different unicode characters. As OP frames the challenge in terms of huge data sets, it is perfectly worthwhile making a lookup table that has True
at the character id's that we want to keep and False
at the ones that have to go --- the punctuation in our example.
基本的策略其实很简单。只有0x110000
不同的 unicode 字符。由于 OP 在庞大的数据集方面提出了挑战,因此非常值得制作一个查找表,其中包含True
我们想要保留的字符 ID 和必须删除的字符 ID False
--- 我们示例中的标点符号。
Such a lookup table can be used for bulk loookup using numpy's advanced indexing. As lookup is fully vectorized and essentially amounts to dereferencing an array of pointers it is much faster than for example dictionary lookup. Here we make use of numpy view casting which allows to reinterpret unicode characters as integers essentially for free.
这样的查找表可用于使用 numpy 的高级索引进行批量查找。由于查找是完全矢量化的并且本质上相当于取消引用指针数组,因此它比例如字典查找要快得多。在这里,我们使用 numpy 视图转换,它允许将 unicode 字符重新解释为本质上免费的整数。
Using the data array which contains just one monster string reinterpreted as a sequence of numbers to index into the lookup table results in a boolean mask. This mask can then be used to filter out the unwanted characters. Using boolean indexing this, too, is a single line of code.
使用仅包含一个怪物字符串的数据数组被重新解释为一系列数字以索引到查找表中,结果是一个布尔掩码。然后可以使用此掩码过滤掉不需要的字符。使用布尔索引,这也是一行代码。
So far so simple. The tricky bit is chopping up the monster string back into its parts. If we have a separator, i.e. one character that does not occur in the data or the punctuation list, then it still is easy. Use this character to join and resplit. However, automatically finding a separator is challenging and indeed accounts for half the loc in the implementation below.
到目前为止如此简单。棘手的一点是将怪物弦切回原处。如果我们有一个分隔符,即一个没有出现在数据或标点列表中的字符,那么它仍然很容易。使用此字符加入和重新拆分。然而,自动找到一个分隔符是具有挑战性的,并且在下面的实现中确实占了 loc 的一半。
Alternatively, we can keep the split points in a separate data structure, track how they move as a consequence of deleting unwanted characters and then use them to slice the processed monster string. As chopping up into parts of uneven length is not numpy's strongest suit, this method is slower than str.split
and only used as a fallback when a separator would be too expensive to calculate if it existed in the first place.
或者,我们可以将分割点保留在单独的数据结构中,跟踪它们因删除不需要的字符而移动的方式,然后使用它们对处理过的怪物字符串进行切片。由于切碎成长度不均匀的部分并不是 numpy 的强项,因此这种方法比str.split
分离器最初存在时计算成本太高而仅用作后备方法要慢,并且仅用作后备。
Code (timing/plotting heavily based on @COLDSPEED's post):
代码(主要基于@COLDSPEED 的帖子进行计时/绘图):
import numpy as np
import pandas as pd
import string
import re
spct = np.array([string.punctuation]).view(np.int32)
lookup = np.zeros((0x110000,), dtype=bool)
lookup[spct] = True
invlookup = ~lookup
OSEP = spct[0]
SEP = chr(OSEP)
while SEP in string.punctuation:
OSEP = np.random.randint(0, 0x110000)
SEP = chr(OSEP)
def find_sep_2(letters):
letters = np.array([letters]).view(np.int32)
msk = invlookup.copy()
msk[letters] = False
sep = msk.argmax()
if not msk[sep]:
return None
return sep
def find_sep(letters, sep=0x88000):
letters = np.array([letters]).view(np.int32)
cmp = np.sign(sep-letters)
cmpf = np.sign(sep-spct)
if cmp.sum() + cmpf.sum() >= 1:
left, right, gs = sep+1, 0x110000, -1
else:
left, right, gs = 0, sep, 1
idx, = np.where(cmp == gs)
idxf, = np.where(cmpf == gs)
sep = (left + right) // 2
while True:
cmp = np.sign(sep-letters[idx])
cmpf = np.sign(sep-spct[idxf])
if cmp.all() and cmpf.all():
return sep
if cmp.sum() + cmpf.sum() >= (left & 1 == right & 1):
left, sep, gs = sep+1, (right + sep) // 2, -1
else:
right, sep, gs = sep, (left + sep) // 2, 1
idx = idx[cmp == gs]
idxf = idxf[cmpf == gs]
def np_multi_strat(df):
L = df['text'].tolist()
all_ = ''.join(L)
sep = 0x088000
if chr(sep) in all_: # very unlikely ...
if len(all_) >= 0x110000: # fall back to separator-less method
# (finding separator too expensive)
LL = np.array((0, *map(len, L)))
LLL = LL.cumsum()
all_ = np.array([all_]).view(np.int32)
pnct = invlookup[all_]
NL = np.add.reduceat(pnct, LLL[:-1])
NLL = np.concatenate([[0], NL.cumsum()]).tolist()
all_ = all_[pnct]
all_ = all_.view(f'U{all_.size}').item(0)
return df.assign(text=[all_[NLL[i]:NLL[i+1]]
for i in range(len(NLL)-1)])
elif len(all_) >= 0x22000: # use mask
sep = find_sep_2(all_)
else: # use bisection
sep = find_sep(all_)
all_ = np.array([chr(sep).join(L)]).view(np.int32)
pnct = invlookup[all_]
all_ = all_[pnct]
all_ = all_.view(f'U{all_.size}').item(0)
return df.assign(text=all_.split(chr(sep)))
def pd_replace(df):
return df.assign(text=df['text'].str.replace(r'[^\w\s]+', ''))
p = re.compile(r'[^\w\s]+')
def re_sub(df):
return df.assign(text=[p.sub('', x) for x in df['text'].tolist()])
punct = string.punctuation.replace(SEP, '')
transtab = str.maketrans(dict.fromkeys(punct, ''))
def translate(df):
return df.assign(
text=SEP.join(df['text'].tolist()).translate(transtab).split(SEP)
)
# MaxU's version (https://stackoverflow.com/a/50444659/4909087)
def pd_translate(df):
return df.assign(text=df['text'].str.translate(transtab))
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['translate', 'pd_replace', 're_sub', 'pd_translate', 'np_multi_strat'],
columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000,
1000000],
dtype=float
)
for c in res.columns:
if c >= 100000: # stress test the separator finder
all_ = np.r_[:OSEP, OSEP+1:0x110000].repeat(c//10000)
np.random.shuffle(all_)
split = np.arange(c-1) + \
np.sort(np.random.randint(0, len(all_) - c + 2, (c-1,)))
l = [x.view(f'U{x.size}').item(0) for x in np.split(all_, split)]
else:
l = ['a..b?!??', '%hgh&12','abc123!!!', '$$34'] * c
df = pd.DataFrame({'text' : l})
for f in res.index:
if f == res.index[0]:
ref = globals()[f](df).text
elif not (ref == globals()[f](df).text).all():
res.at[f, c] = np.nan
print(f, 'disagrees at', c)
continue
stmt = '{}(df)'.format(f)
setp = 'from __main__ import df, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=16)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
回答by MaxU
Interesting enough that vectorized Series.str.translatemethod is still slightly slower compared to Vanilla Python str.translate()
:
有趣的是,矢量化的Series.str.translate方法与 Vanilla Python 相比仍然稍慢str.translate()
:
def pd_translate(df):
return df.assign(text=df['text'].str.translate(transtab))