python 当字符串中有非ASCII字符时,如何将C字符串(char数组)转换为Python字符串?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/213628/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert a C string (char array) into a Python string when there are non-ASCII characters in the string?
提问by Vebjorn Ljosa
I have embedded a Python interpreter in a C program. Suppose the C program reads some bytes from a file into a char array and learns (somehow) that the bytes represent text with a certain encoding (e.g., ISO 8859-1, Windows-1252, or UTF-8). How do I decode the contents of this char array into a Python string?
我在 C 程序中嵌入了一个 Python 解释器。假设 C 程序从文件中读取一些字节到一个 char 数组中,并知道(以某种方式)这些字节表示具有特定编码(例如,ISO 8859-1、Windows-1252 或 UTF-8)的文本。如何将此字符数组的内容解码为 Python 字符串?
The Python string should in general be of type unicode
—for instance, a 0x93
in Windows-1252 encoded input becomes a u'\u0201c'
.
Python 字符串通常应该是类型unicode
——例如,0x93
Windows-1252 编码输入中的 a 变为u'\u0201c'
.
I have attempted to use PyString_Decode
, but it always fails when there are non-ASCII characters in the string. Here is an example that fails:
我曾尝试使用PyString_Decode
,但当字符串中有非 ASCII 字符时它总是失败。这是一个失败的例子:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *py_string;
Py_Initialize();
py_string = PyString_Decode(c_string, 1, "windows_1252", "replace");
if (!py_string) {
PyErr_Print();
return 1;
}
return 0;
}
The error message is UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
, which indicates that the ascii
encoding is used even though we specify windows_1252
in the call to PyString_Decode
.
错误消息是UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
,这表明ascii
即使我们windows_1252
在对 的调用中指定了编码也被使用PyString_Decode
。
The following code works around the problem by using PyString_FromString
to create a Python string of the undecoded bytes, then calling its decode
method:
以下代码通过使用PyString_FromString
创建未解码字节的 Python 字符串,然后调用其decode
方法来解决该问题:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *raw, *decoded;
Py_Initialize();
raw = PyString_FromString(c_string);
printf("Undecoded: ");
PyObject_Print(raw, stdout, 0);
printf("\n");
decoded = PyObject_CallMethod(raw, "decode", "s", "windows_1252");
Py_DECREF(raw);
printf("Decoded: ");
PyObject_Print(decoded, stdout, 0);
printf("\n");
return 0;
}
采纳答案by Tony Meyer
PyString_Decode does this:
PyString_Decode 这样做:
PyObject *PyString_Decode(const char *s,
Py_ssize_t size,
const char *encoding,
const char *errors)
{
PyObject *v, *str;
str = PyString_FromStringAndSize(s, size);
if (str == NULL)
return NULL;
v = PyString_AsDecodedString(str, encoding, errors);
Py_DECREF(str);
return v;
}
IOW, it does basically what you're doing in your second example - converts to a string, then decode the string. The problem here arises from PyString_AsDecodedString, rather than PyString_AsDecodedObject. PyString_AsDecodedString does PyString_AsDecodedObject, but then tries to convert the resulting unicode object into a string object with the default encoding (for you, looks like that's ASCII). That's where it fails.
IOW,它基本上完成了您在第二个示例中所做的工作 - 转换为字符串,然后对字符串进行解码。这里的问题来自 PyString_AsDecodedString,而不是 PyString_AsDecodedObject。PyString_AsDecodedString 执行 PyString_AsDecodedObject,但随后尝试将生成的 unicode 对象转换为具有默认编码的字符串对象(对您来说,看起来像是 ASCII)。这就是它失败的地方。
I believe you'll need to do two calls - but you can use PyString_AsDecodedObject rather than calling the python "decode" method. Something like:
我相信您需要进行两次调用 - 但您可以使用 PyString_AsDecodedObject 而不是调用 python 的“解码”方法。就像是:
#include <Python.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
char c_string[] = { (char)0x93, 0 };
PyObject *py_string, *py_unicode;
Py_Initialize();
py_string = PyString_FromStringAndSize(c_string, 1);
if (!py_string) {
PyErr_Print();
return 1;
}
py_unicode = PyString_AsDecodedObject(py_string, "windows_1252", "replace");
Py_DECREF(py_string);
return 0;
}
I'm not entirely sure what the reasoning behind PyString_Decode working this way is. A very old thread on python-devseems to indicate that it has something to do with chaining the output, but since the Python methods don't do the same, I'm not sure if that's still relevant.
我不完全确定 PyString_Decode 以这种方式工作的原因是什么。python-dev 上的一个非常古老的线程似乎表明它与链接输出有关,但由于 Python 方法不这样做,我不确定这是否仍然相关。
回答by Dan Lenski
You don't want to decode the string into a Unicode representation, you just want to treat it as an array of bytes, right?
您不想将字符串解码为 Unicode 表示,您只想将其视为字节数组,对吗?
Just use PyString_FromString
:
只需使用PyString_FromString
:
char *cstring;
PyObject *pystring = PyString_FromString(cstring);
That's all. Now you have a Python str()
object. See docs here: https://docs.python.org/2/c-api/string.html
就这样。现在你有了一个 Pythonstr()
对象。请参阅此处的文档:https: //docs.python.org/2/c-api/string.html
I'm a little bit confused about how to specify "str" or "unicode." They are quite different if you have non-ASCII characters. If you want to decode a C string andyou know exactly what character set it's in, then yes, PyString_DecodeString
is a good place to start.
我对如何指定“str”或“unicode”有点困惑。如果您有非 ASCII 字符,它们就大不相同了。如果您想解码一个 C 字符串并且您确切地知道它所在的字符集,那么是的,这PyString_DecodeString
是一个很好的起点。
回答by Alex Coventry
Try calling PyErr_Print()
in the "if (!py_string)
" clause. Perhaps the python exception will give you some more information.
尝试调用PyErr_Print()
“ if (!py_string)
”子句。也许 python 异常会给你一些更多的信息。