Python 初始化字符串数据的 numpy 数组的奇怪行为
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13717554/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Weird behaviour initializing a numpy array of string data
提问by Jim
I am having some seemingly trivial trouble with numpy when the array contains string data. I have the following code:
当数组包含字符串数据时,我在使用 numpy 时遇到了一些看似微不足道的问题。我有以下代码:
my_array = numpy.empty([1, 2], dtype = str)
my_array[0, 0] = "Cat"
my_array[0, 1] = "Apple"
Now, when I print it with print my_array[0, :], the response I get is ['C', 'A'], which is clearly not the expected output of Cat and Apple. Why is that, and how can I get the right output?
现在,当我用 打印它时print my_array[0, :],得到的响应是['C', 'A'],这显然不是 Cat 和 Apple 的预期输出。为什么会这样,我怎样才能获得正确的输出?
Thanks!
谢谢!
采纳答案by BrenBarn
Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str, it sets this maximum length to 1 by default. You can see if you do my_array.dtype; it will show "|S1", meaning "one-character string". Subsequent assignments into the array are truncated to fit this structure.
Numpy 要求字符串数组具有固定的最大长度。当您使用 来创建一个空数组时dtype=str,它默认将此最大长度设置为 1。你可以看看你是否这样做my_array.dtype;它将显示“|S1”,意思是“一个字符串”。对数组的后续分配将被截断以适应此结构。
You can pass an explicit datatype with your maximum length by doing, e.g.:
您可以通过执行以下操作传递具有最大长度的显式数据类型,例如:
my_array = numpy.empty([1, 2], dtype="S10")
The "S10" will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.
“S10”将创建一个长度为 10 的字符串数组。您必须决定有多大才能容纳您想要保存的所有数据。
回答by Johny White
I got a "codec error" when I tried to use a non-ascii character with dtype="S10"
当我尝试使用非 ascii 字符时出现“编解码器错误” dtype="S10"
You also get an array with binary strings, which confused me.
你还会得到一个包含二进制字符串的数组,这让我很困惑。
I think it is better to use:
我认为最好使用:
my_array = numpy.empty([1, 2], dtype="<U10")
my_array = numpy.empty([1, 2], dtype="<U10")
Here 'U10' translates to "Unicode string of length 10; little endian format"
这里的“U10”翻译为“长度为 10 的 Unicode 字符串;小端格式”
回答by spinup
The numpy string array is limited by its fixed length (length 1 by default). If you're unsure what length you'll need for your strings in advance, you can use dtype=objectand get arbitrary length strings for your data elements:
numpy 字符串数组受其固定长度(默认长度为 1)的限制。如果您事先不确定您的字符串需要多长的长度,您可以dtype=object为您的数据元素使用和获取任意长度的字符串:
my_array = numpy.empty([1, 2], dtype=object)
I understand there may be efficiency drawbacks to this approach, but I don't have a good reference to support that.
我知道这种方法可能存在效率缺陷,但我没有很好的参考资料来支持这一点。
回答by Plamen
Another alternative is to initialize as follows:
另一种选择是初始化如下:
my_array = np.array([["CAT","APPLE"],['','']], dtype=str)
In other words, first you write a regular array with what you want, then you turn it into a numpy array. However, this will fix your max string length to the length of the longest string at initialization. So if you were to add
换句话说,首先你用你想要的东西写一个常规数组,然后把它变成一个 numpy 数组。但是,这会将您的最大字符串长度固定为初始化时最长字符串的长度。所以如果你要添加
my_array[1,0] = 'PINEAPPLE'
then the string stored would be 'PINEA'.
那么存储的字符串将是'PINEA'。
回答by KanDan
What works best if you are doing a for loop is to start a list comprehension, which will allow you to allocate the right memory.
如果您正在执行 for 循环,最好的方法是启动列表理解,这将允许您分配正确的内存。
data = ['CAT','APPLE,'CARROT']
my_array = [name for name in data]

