pandas 基于标点符号列表替换数据框中的标点符号

Question

提问by BernardL

Using Canopy and Pandas, I have data frame a which is defined by:

使用 Canopy 和 Pandas，我有数据框 a 由以下定义：

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.

test.txt 是一个单列文件，其中包含一个包含文本、数字和标点符号的字符串列表。

Assuming df looks like:

假设 df 看起来像：

test
%hgh&12
abc123!!!
porkyfries

测试
%hgh&12
abc123！！！
炸猪排

I want my results to be:

我希望我的结果是：

test
hgh12
abc123
porkyfries

测试
hgh12
abc123
炸猪排

Effort so far:

迄今为止的努力：

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

The command above basically just returns me with the same data set. Appreciate any leads.

上面的命令基本上只是用相同的数据集返回给我。感谢任何线索。

Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows. Long story short, I need to clean data in a very efficient manner for big data sets.

编辑：我使用 Pandas 的原因是因为数据巨大，跨越大约 100 万行，并且将来使用编码将应用于高达 3000 万行的列表。长话短说，我需要以非常有效的方式清理大数据集的数据。

Answer 1

采纳答案by EdChum

Use replacewith correct regex would be easier:

使用replace正确的正则表达式会更容易：

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

use regex with the pattern which means not alphanumeric/whitespace

将正则表达式与模式一起使用，这意味着不是字母数字/空格

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]

Answer 2

回答by Aakash Saxena

For removing punctuation from a text column in your dataframme:

要从数据框中的文本列中删除标点符号：

In:

在：

import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)

pattern

Out:

出去：

'[!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]'

In:

在：

df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df

Out:

出去：

        text
0  book...regh
1      book...
2         boo,
3       book. 
4       ball, 
5   ballnroll"
6       "rope"
7      rick %

In:

在：

df['text'] = df['text'].str.replace(pattern, '')
df

You can replace the pattern with your desired character. Ex - replace(pattern, '$')

您可以用您想要的字符替换模式。例如 - 替换（模式，'$'）

Out:

出去：

        text
0   bookregh
1       book
2        boo
3      book 
4      ball 
5  ballnroll
6       rope
7     rick

Answer 3

回答by philshem

Translate is often considered the cleanest and fastest way to remove punctuation (source)

翻译通常被认为是去除标点符号的最干净、最快捷的方式（来源）

import string
text = text.translate(None, string.punctuation.translate(None, '"'))

You may find that it works better to remove punctuation in 'a' before loading it into pandas.

您可能会发现在将 'a' 加载到 Pandas 之前删除它的标点符号效果更好。

pandas 基于标点符号列表替换数据框中的标点符号

提问by BernardL

采纳答案by EdChum

回答by Aakash Saxena

回答by philshem

相关推荐

最近更新

标签

pandas 基于标点符号列表替换数据框中的标点符号

提问by BernardL

采纳答案by EdChum

回答by Aakash Saxena

回答by philshem

相关推荐

pandas 选择pandas groupby数据帧的子集，其中多个键具有值

pandas 如何在python中修剪一系列字符串对象？

复制 Pandas DF N 次

Pandas GroupBy.apply 方法复制第一组

相关推荐

最近更新

标签