Python 从熊猫文本中删除unicode
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/30337402/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
removing unicode from text in pandas
提问by user2476665
for one string, the code below removes unicode characters & new lines/carriage returns:
对于一个字符串,下面的代码删除了 unicode 字符和换行符/回车符:
t = "We've\xe5\xcabeen invited to attend TEDxTeen, an independently organized TED event focused on encouraging youth to find \x89\xdb\xcfsimply irresistible\x89\xdb\x9d solutions to the complex issues we face every day.,"
t2 = t.decode('unicode_escape').encode('ascii', 'ignore').strip()
import sys
sys.stdout.write(t2.strip('\n\r'))
but when I try to write a function in pandas to apply this to every cell of a column, it either fails because of an attribute error or I get a warning that a value is trying to be set on a copy of a slice from a DataFrame
但是当我尝试在 Pandas 中编写一个函数以将其应用于列的每个单元格时,它要么由于属性错误而失败,要么我收到警告,提示正在尝试在 DataFrame 的切片副本上设置值
def clean_text(row):
row= row["text"].decode('unicode_escape').encode('ascii', 'ignore')#.strip()
import sys
sys.stdout.write(row.strip('\n\r'))
return row
applied to my dataframe:
应用于我的数据框:
df["text"] = df.apply(clean_text, axis=1)
how can I apply this code to each element of a Series?
如何将此代码应用于系列的每个元素?
回答by maxymoo
I actually can't reproduce your error: the following code runs for me without an error or warning.
我实际上无法重现您的错误:以下代码为我运行,没有错误或警告。
df = pd.DataFrame([t,t,t],columns = ['text'])
df["text"] = df.apply(clean_text, axis=1)
If it helps, I think a more "pandas" way to approach this type of problem might be to use a regex with one of the DataFrame.str
methods for example:
如果有帮助,我认为解决此类问题的更“熊猫”方式可能是使用正则表达式和其中一种DataFrame.str
方法,例如:
df["text"] = df.text.str.replace('[^\x00-\x7F]','')
回答by Alexander
Something like this, where column_to_convert is the column you'd like to convert:
像这样的东西,其中 column_to_convert 是您要转换的列:
series = df['column_to_convert']
df["text"] = [s.encode('ascii', 'ignore').strip()
for s in series.str.decode('unicode_escape')]
回答by Anzel
The problem seems like you are trying to access and alter row['text']
and return the row itself when doing the apply function, when you do apply
on a DataFrame
, it's applying to each Series, so if changed to this should help:
问题似乎是您row['text']
在执行 apply 函数时试图访问和更改并返回行本身,当您apply
在 a 上执行时DataFrame
,它适用于每个系列,因此如果更改为 this 应该会有所帮助:
import pandas as pd
df = pd.DataFrame([t for _ in range(5)], columns=['text'])
df
text
0 We've??????been invited to attend TEDxTeen, an ind...
1 We've??????been invited to attend TEDxTeen, an ind...
2 We've??????been invited to attend TEDxTeen, an ind...
3 We've??????been invited to attend TEDxTeen, an ind...
4 We've??????been invited to attend TEDxTeen, an ind...
def clean_text(row):
# return the list of decoded cell in the Series instead
return [r.decode('unicode_escape').encode('ascii', 'ignore') for r in row]
df['text'] = df.apply(clean_text)
df
text
0 We'vebeen invited to attend TEDxTeen, an indep...
1 We'vebeen invited to attend TEDxTeen, an indep...
2 We'vebeen invited to attend TEDxTeen, an indep...
3 We'vebeen invited to attend TEDxTeen, an indep...
4 We'vebeen invited to attend TEDxTeen, an indep...
Alternatively you might use lambda
as below, and directly apply to only text
column:
或者,您可以使用lambda
如下,并直接应用于仅text
列:
df['text'] = df['text'].apply(lambda x: x.decode('unicode_escape').\
encode('ascii', 'ignore').\
strip())