pandas 基于标点符号列表替换数据框中的标点符号
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/21672514/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Replacing punctuation in a data frame based on punctuation list
提问by BernardL
Using Canopy and Pandas, I have data frame a which is defined by:
使用 Canopy 和 Pandas,我有数据框 a 由以下定义:
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"]
test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.
test.txt 是一个单列文件,其中包含一个包含文本、数字和标点符号的字符串列表。
Assuming df looks like:
假设 df 看起来像:
test
%hgh&12
abc123!!!
porkyfries
测试
%hgh&12
abc123!!!
炸猪排
I want my results to be:
我希望我的结果是:
test
hgh12
abc123
porkyfries
测试
hgh12
abc123
炸猪排
Effort so far:
迄今为止的努力:
from string import punctuation /-- import punctuation list from python itself
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"] /-- define the dataframe
for p in list(punctuation):
     ...:     df2=df.med.str.replace(p,'')
     ...:     df2=pd.DataFrame(df2);
     ...:     df2
The command above basically just returns me with the same data set. Appreciate any leads.
上面的命令基本上只是用相同的数据集返回给我。感谢任何线索。
Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows. Long story short, I need to clean data in a very efficient manner for big data sets.
编辑:我使用 Pandas 的原因是因为数据巨大,跨越大约 100 万行,并且将来使用编码将应用于高达 3000 万行的列表。长话短说,我需要以非常有效的方式清理大数据集的数据。
采纳答案by EdChum
Use replacewith correct regex would be easier:
使用replace正确的正则表达式会更容易:
In [41]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries
[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
将正则表达式与模式一起使用,这意味着不是字母数字/空格
In [49]:
df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries
[4 rows x 1 columns]
回答by Aakash Saxena
For removing punctuation from a text column in your dataframme:
要从数据框中的文本列中删除标点符号:
In:
在:
import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)
pattern
Out:
出去:
'[!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]'
In:
在:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df
Out:
出去:
        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick % 
In:
在:
df['text'] = df['text'].str.replace(pattern, '')
df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
您可以用您想要的字符替换模式。例如 - 替换(模式,'$')
Out:
出去:
        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick  
回答by philshem
Translate is often considered the cleanest and fastest way to remove punctuation (source)
翻译通常被认为是去除标点符号的最干净、最快捷的方式(来源)
import string
text = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.
您可能会发现在将 'a' 加载到 Pandas 之前删除它的标点符号效果更好。

