python 当字符串中有非ASCII字符时，如何将C字符串（char数组）转换为Python字符串？

Question

提问by Vebjorn Ljosa

I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?

我在 C 程序中嵌入了一个 Python 解释器。假设 C 程序从文件中读取一些字节到一个 char 数组中，并知道（以某种方式）这些字节表示具有特定编码（例如，ISO 8859-1、Windows-1252 或 UTF-8）的文本。如何将此字符数组的内容解码为 Python 字符串？

The Python string should in general be of type unicode—for instance, a 0x93in Windows-1252 encoded input becomes a u'\u0201c'.

Python 字符串通常应该是类型unicode——例如，0x93Windows-1252 编码输入中的 a 变为u'\u0201c'.

I have attempted to use PyString_Decode, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:

我曾尝试使用PyString_Decode，但当字符串中有非 ASCII 字符时它总是失败。这是一个失败的例子：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string;

     Py_Initialize();

     py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     return 0;
}

The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128), which indicates that the asciiencoding is used even though we specify windows_1252in the call to PyString_Decode.

错误消息是UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)，这表明ascii即使我们windows_1252在对的调用中指定了编码也被使用PyString_Decode。

The following code works around the problem by using PyString_FromStringto create a Python string of the undecoded bytes, then calling its decodemethod:

以下代码通过使用PyString_FromString创建未解码字节的 Python 字符串，然后调用其decode方法来解决该问题：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *raw, *decoded;

     Py_Initialize();

     raw = PyString_FromString(c_string);
     printf("Undecoded: ");
     PyObject_Print(raw, stdout, 0);
     printf("\n");
     decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
     Py_DECREF(raw);
     printf("Decoded: ");
     PyObject_Print(decoded, stdout, 0);
     printf("\n");
     return 0;
}

Answer 1

采纳答案by Tony Meyer

PyString_Decode does this:

PyString_Decode 这样做：

PyObject *PyString_Decode(const char *s,
              Py_ssize_t size,
              const char *encoding,
              const char *errors)
{
    PyObject *v, *str;

    str = PyString_FromStringAndSize(s, size);
    if (str == NULL)
    return NULL;
    v = PyString_AsDecodedString(str, encoding, errors);
    Py_DECREF(str);
    return v;
}

IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.

IOW，它基本上完成了您在第二个示例中所做的工作 - 转换为字符串，然后对字符串进行解码。这里的问题来自 PyString_AsDecodedString，而不是 PyString_AsDecodedObject。PyString_AsDecodedString 执行 PyString_AsDecodedObject，但随后尝试将生成的 unicode 对象转换为具有默认编码的字符串对象（对您来说，看起来像是 ASCII）。这就是它失败的地方。

I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:

我相信您需要进行两次调用 - 但您可以使用 PyString_AsDecodedObject 而不是调用 python 的“解码”方法。就像是：

#include <Python.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
     char c_string[] = { (char)0x93, 0 };
     PyObject *py_string, *py_unicode;

     Py_Initialize();

     py_string = PyString_FromStringAndSize(c_string, 1);
     if (!py_string) {
          PyErr_Print();
          return 1;
     }
     py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
     Py_DECREF(py_string);

     return 0;
}

I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-devseems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.

我不完全确定 PyString_Decode 以这种方式工作的原因是什么。python-dev 上的一个非常古老的线程似乎表明它与链接输出有关，但由于 Python 方法不这样做，我不确定这是否仍然相关。

Answer 2

回答by Dan Lenski

You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?

您不想将字符串解码为 Unicode 表示，您只想将其视为字节数组，对吗？

Just use PyString_FromString:

只需使用PyString_FromString：

char *cstring;
PyObject *pystring = PyString_FromString(cstring);

That's all. Now you have a Python str()object. See docs here: https://docs.python.org/2/c-api/string.html

就这样。现在你有了一个 Pythonstr()对象。请参阅此处的文档：https: //docs.python.org/2/c-api/string.html

I'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string andyou know exactly what character set it's in, then yes, PyString_DecodeStringis a good place to start.

我对如何指定“str”或“unicode”有点困惑。如果您有非 ASCII 字符，它们就大不相同了。如果您想解码一个 C 字符串并且您确切地知道它所在的字符集，那么是的，这PyString_DecodeString是一个很好的起点。

Answer 3

回答by Alex Coventry

Try calling PyErr_Print()in the "if (!py_string)" clause. Perhaps the python exception will give you some more information.

尝试调用PyErr_Print()“ if (!py_string)”子句。也许 python 异常会给你一些更多的信息。

python 当字符串中有非ASCII字符时，如何将C字符串（char数组）转换为Python字符串？

提问by Vebjorn Ljosa

采纳答案by Tony Meyer

回答by Dan Lenski

回答by Alex Coventry

相关推荐

最近更新

标签

python 当字符串中有非ASCII字符时，如何将C字符串（char数组）转换为Python字符串？

提问by Vebjorn Ljosa

采纳答案by Tony Meyer

回答by Dan Lenski

回答by Alex Coventry

相关推荐

Python 范围：“UnboundLocalError：赋值前引用了局部变量‘c’”

python 保留装饰函数的签名

从 Python 脚本使用 POST 发送文件

用于维基标记的 Python 模块

相关推荐

最近更新

标签