Python 从熊猫文本中删除unicode

Question

提问by user2476665

for one string, the code below removes unicode characters & new lines/carriage returns:

对于一个字符串，下面的代码删除了 unicode 字符和换行符/回车符：

t = "We've\xe5\xcabeen invited to attend TEDxTeen, an independently organized TED event focused on encouraging youth to find \x89\xdb\xcfsimply irresistible\x89\xdb\x9d solutions to the complex issues we face every day.,"

t2 = t.decode('unicode_escape').encode('ascii', 'ignore').strip()
import sys
sys.stdout.write(t2.strip('\n\r'))

but when I try to write a function in pandas to apply this to every cell of a column, it either fails because of an attribute error or I get a warning that a value is trying to be set on a copy of a slice from a DataFrame

但是当我尝试在 Pandas 中编写一个函数以将其应用于列的每个单元格时，它要么由于属性错误而失败，要么我收到警告，提示正在尝试在 DataFrame 的切片副本上设置值

def clean_text(row):
    row= row["text"].decode('unicode_escape').encode('ascii', 'ignore')#.strip()
    import sys
    sys.stdout.write(row.strip('\n\r'))
    return row

applied to my dataframe:

应用于我的数据框：

df["text"] = df.apply(clean_text, axis=1)

how can I apply this code to each element of a Series?

如何将此代码应用于系列的每个元素？

Answer 1

回答by maxymoo

I actually can't reproduce your error: the following code runs for me without an error or warning.

我实际上无法重现您的错误：以下代码为我运行，没有错误或警告。

df = pd.DataFrame([t,t,t],columns = ['text'])
df["text"] = df.apply(clean_text, axis=1)

If it helps, I think a more "pandas" way to approach this type of problem might be to use a regex with one of the DataFrame.strmethods for example:

如果有帮助，我认为解决此类问题的更“熊猫”方式可能是使用正则表达式和其中一种DataFrame.str方法，例如：

df["text"] =  df.text.str.replace('[^\x00-\x7F]','')

Answer 2

回答by Alexander

Something like this, where column_to_convert is the column you'd like to convert:

像这样的东西，其中 column_to_convert 是您要转换的列：

series = df['column_to_convert']
df["text"] =  [s.encode('ascii', 'ignore').strip()
               for s in series.str.decode('unicode_escape')]

Answer 3

回答by Anzel

The problem seems like you are trying to access and alter row['text']and return the row itself when doing the apply function, when you do applyon a DataFrame, it's applying to each Series, so if changed to this should help:

问题似乎是您row['text']在执行 apply 函数时试图访问和更改并返回行本身，当您apply在 a 上执行时DataFrame，它适用于每个系列，因此如果更改为 this 应该会有所帮助：

import pandas as pd

df = pd.DataFrame([t for _ in range(5)], columns=['text'])

df 
                                                text
0  We've??????been invited to attend TEDxTeen, an ind...
1  We've??????been invited to attend TEDxTeen, an ind...
2  We've??????been invited to attend TEDxTeen, an ind...
3  We've??????been invited to attend TEDxTeen, an ind...
4  We've??????been invited to attend TEDxTeen, an ind...

def clean_text(row):
    # return the list of decoded cell in the Series instead 
    return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]

df['text'] = df.apply(clean_text)

df
                                                text
0  We'vebeen invited to attend TEDxTeen, an indep...
1  We'vebeen invited to attend TEDxTeen, an indep...
2  We'vebeen invited to attend TEDxTeen, an indep...
3  We'vebeen invited to attend TEDxTeen, an indep...
4  We'vebeen invited to attend TEDxTeen, an indep...

Alternatively you might use lambdaas below, and directly apply to only textcolumn:

或者，您可以使用lambda如下，并直接应用于仅text列：

df['text'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
                                          encode('ascii', 'ignore').\
                                          strip())

Python 从熊猫文本中删除unicode

提问by user2476665

回答by maxymoo

回答by Alexander

回答by Anzel

相关推荐

最近更新

标签

Python 从熊猫文本中删除unicode

提问by user2476665

回答by maxymoo

回答by Alexander

回答by Anzel

相关推荐

Python scikit-learn 中处理 nan/null 的分类器

Python 将 2D numpy 数组转换为 2D numpy 矩阵

Python 在 docker 中部署最小的 Flask 应用程序 - 服务器连接问题

Python 读取一个巨大的 .csv 文件

相关推荐

最近更新

标签