pandas 替换熊猫数据框中的特殊字符

Question

提问by Raphael Hernandes

So, I have this huge DF which encoded in iso8859_15.

所以，我有这个巨大的 DF，它用 iso8859_15 编码。

I have a few columns which contain names and places in Brazil, so some of them contain special characters such as "í" or "?".

我有几列包含巴西的名称和地点，因此其中一些包含特殊字符，例如“í”或“？”。

I have the key to replace them in a dictionary {'í':'i', 'á':'a', ...}

我有在字典中替换它们的钥匙 {'í':'i', 'á':'a', ...}

I tried replacing it a couple of ways (below), but none of them worked.

我尝试了几种方法来替换它（如下），但它们都不起作用。

df.replace(dictionary, regex=True, inplace=True) ###BOTH WITH AND WITHOUT REGEX AND REPLACE

Also:

还：

df.udpate(pd.Series(dic))

None of them had the expected output, which would be for strings such as "NíCOLAS" to become "NICOLAS".

它们都没有预期的输出，即诸如“NíCOLAS”之类的字符串将变成“NICOLAS”。

Help?

帮助？

Answer 1

采纳答案by randomir

The docs on pandas.DataFrame.replacesays you have to provide a nested dictionary: the first level is the column namefor which you have to provide a second dictionary with substitution pairs.

上的文档pandas.DataFrame.replace说您必须提供一个嵌套字典：第一级是列名，您必须为其提供带有替换对的第二个字典。

So, this should work:

所以，这应该有效：

>>> df=pd.DataFrame({'a': ['NíCOLAS','asd?'], 'b': [3,4]})
>>> df
         a  b
0  NíCOLAS  3
1     asd?  4

>>> df.replace({'a': {'?': 'c', 'í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

Edit.Seems pandasalso accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:

编辑。似乎pandas也接受非嵌套翻译字典。在这种情况下，问题可能出在字符编码上，尤其是当您使用Python 2 时。假设您的 CSV 加载函数正确解码了文件字符（作为真正的 Unicode 代码点），那么您应该注意您的翻译/替换字典也是用 Unicode 字符定义的，如下所示：

dictionary = {u'í': 'i', u'á': 'a'}

If you have a definition like this (and using Python 2):

如果您有这样的定义（并使用 Python 2）：

dictionary = {'í': 'i', 'á': 'a'}

then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:

那么该字典中的实际键是多字节字符串。它们是哪些字节（字符）取决于使用的实际源文件字符编码，但假设您使用 UTF-8，您将获得：

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

And that would explain why pandasfails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'.

这将解释为什么pandas无法替换这些字符。所以，一定在Python 2使用Unicode文字：u'this is unicode string'。

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the uprefix (in fact unicodetype from Python 2 is renamed to strin Python 3, and the old strfrom Python 2 is now bytesin Python 3).

另一方面，在 Python 3 中，所有字符串都是 Unicode 字符串，您不必使用u前缀（实际上unicodePython 2 中的类型str在 Python 3 中重命名为，而strPython 2 中的旧类型现在bytes在 Python 3 中））。

Answer 2

回答by OverflowingTheGlass

replaceworks out of the box without specifying a specific column in Python 3.

replace无需在 Python 3 中指定特定列即可开箱即用。

Load Data:

加载数据：

df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df

Result:

结果：

col1    col2
0   he  hello
1   Nícolas shárk
2   welcome yes

Create Dictionary:

创建字典：

dictionary = {'í':'i', 'á':'a'}

Replace:

代替：

df.replace(dictionary, regex=True, inplace=True)

Result:

结果：

 col1   col2
0   he  hello
1   Nicolas shark
2   welcome yes

Answer 3

回答by OverflowingTheGlass

If someone get the following error message

如果有人收到以下错误信息

multiple repeat at position 2

在位置 2 多次重复

try this df.replace(dictionary, regex=False, inplace=True)

尝试这个 df.replace(dictionary, regex=False, inplace=True)

instead of df.replace(dictionary, regex=True, inplace=True)

代替 df.replace(dictionary, regex=True, inplace=True)

pandas 替换熊猫数据框中的特殊字符

提问by Raphael Hernandes

采纳答案by randomir

回答by OverflowingTheGlass

回答by OverflowingTheGlass

相关推荐

最近更新

标签

pandas 替换熊猫数据框中的特殊字符

提问by Raphael Hernandes

采纳答案by randomir

回答by OverflowingTheGlass

回答by OverflowingTheGlass

相关推荐

pandas 列列表 X 整个数据框之间的熊猫相关性

pandas 为什么 df.head() 在 python 中不起作用

根据条件从 Pandas DataFrame 中删除行

pandas UnicodeDecodeError: 'utf-8' 编解码器无法解码位置 3 中的字节 0xcc：无效的连续字节

相关推荐

最近更新

标签