pandas 替换熊猫数据框中的特殊字符
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45596529/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Replacing special characters in pandas dataframe
提问by Raphael Hernandes
So, I have this huge DF which encoded in iso8859_15.
所以,我有这个巨大的 DF,它用 iso8859_15 编码。
I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "?".
我有几列包含巴西的名称和地点,因此其中一些包含特殊字符,例如“í”或“?”。
I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}
我有在字典中替换它们的钥匙 {'í':'i', 'á':'a', ...}
I tried replacing it a couple of ways (below), but none of them worked.
我尝试了几种方法来替换它(如下),但它们都不起作用。
df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE
Also:
还:
df.udpate(pd.Series(dic))
None of them had the expected output, which would be for strings such as "NíCOLAS" to become "NICOLAS".
它们都没有预期的输出,即诸如“NíCOLAS”之类的字符串将变成“NICOLAS”。
Help?
帮助?
采纳答案by randomir
The docs on pandas.DataFrame.replace
says you have to provide a nested dictionary: the first level is the column namefor which you have to provide a second dictionary with substitution pairs.
上的文档pandas.DataFrame.replace
说您必须提供一个嵌套字典:第一级是列名,您必须为其提供带有替换对的第二个字典。
So, this should work:
所以,这应该有效:
>>> df=pd.DataFrame({'a': ['NíCOLAS','asd?'], 'b': [3,4]})
>>> df
a b
0 NíCOLAS 3
1 asd? 4
>>> df.replace({'a': {'?': 'c', 'í': 'I'}}, regex=True)
a b
0 NICOLAS 3
1 asdc 4
Edit.Seems pandas
also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:
编辑。似乎pandas
也接受非嵌套翻译字典。在这种情况下,问题可能出在字符编码上,尤其是当您使用Python 2 时。假设您的 CSV 加载函数正确解码了文件字符(作为真正的 Unicode 代码点),那么您应该注意您的翻译/替换字典也是用 Unicode 字符定义的,如下所示:
dictionary = {u'í': 'i', u'á': 'a'}
If you have a definition like this (and using Python 2):
如果您有这样的定义(并使用 Python 2):
dictionary = {'í': 'i', 'á': 'a'}
then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:
那么该字典中的实际键是多字节字符串。它们是哪些字节(字符)取决于使用的实际源文件字符编码,但假设您使用 UTF-8,您将获得:
dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}
And that would explain why pandas
fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'
.
这将解释为什么pandas
无法替换这些字符。所以,一定在Python 2使用Unicode文字:u'this is unicode string'
。
On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u
prefix (in fact unicode
type from Python 2 is renamed to str
in Python 3, and the old str
from Python 2 is now bytes
in Python 3).
另一方面,在 Python 3 中,所有字符串都是 Unicode 字符串,您不必使用u
前缀(实际上unicode
Python 2 中的类型str
在 Python 3 中重命名为,而str
Python 2 中的旧类型现在bytes
在 Python 3 中) )。
回答by OverflowingTheGlass
replace
works out of the box without specifying a specific column in Python 3.
replace
无需在 Python 3 中指定特定列即可开箱即用。
Load Data:
加载数据:
df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df
Result:
结果:
col1 col2
0 he hello
1 Nícolas shárk
2 welcome yes
Create Dictionary:
创建字典:
dictionary = {'í':'i', 'á':'a'}
Replace:
代替:
df.replace(dictionary, regex=True, inplace=True)
Result:
结果:
col1 col2
0 he hello
1 Nicolas shark
2 welcome yes
回答by OverflowingTheGlass
If someone get the following error message
如果有人收到以下错误信息
multiple repeat at position 2
在位置 2 多次重复
try this df.replace(dictionary, regex=False, inplace=True)
尝试这个 df.replace(dictionary, regex=False, inplace=True)
instead of
df.replace(dictionary, regex=True, inplace=True)
代替
df.replace(dictionary, regex=True, inplace=True)