Python 初始化字符串数据的 numpy 数组的奇怪行为

Question

提问by Jim

I am having some seemingly trivial trouble with numpy when the array contains string data. I have the following code:

当数组包含字符串数据时，我在使用 numpy 时遇到了一些看似微不足道的问题。我有以下代码：

my_array = numpy.empty([1, 2], dtype = str)
my_array[0, 0] = "Cat"
my_array[0, 1] = "Apple"

Now, when I print it with print my_array[0, :], the response I get is ['C', 'A'], which is clearly not the expected output of Cat and Apple. Why is that, and how can I get the right output?

现在，当我用打印它时print my_array[0, :]，得到的响应是['C', 'A']，这显然不是 Cat 和 Apple 的预期输出。为什么会这样，我怎样才能获得正确的输出？

Thanks!

谢谢！

Answer 1

采纳答案by BrenBarn

Numpy requires string arrays to have a fixed maximum length. When you create an empty array with dtype=str, it sets this maximum length to 1 by default. You can see if you do my_array.dtype; it will show "|S1", meaning "one-character string". Subsequent assignments into the array are truncated to fit this structure.

Numpy 要求字符串数组具有固定的最大长度。当您使用来创建一个空数组时dtype=str，它默认将此最大长度设置为 1。你可以看看你是否这样做my_array.dtype；它将显示“|S1”，意思是“一个字符串”。对数组的后续分配将被截断以适应此结构。

You can pass an explicit datatype with your maximum length by doing, e.g.:

您可以通过执行以下操作传递具有最大长度的显式数据类型，例如：

my_array = numpy.empty([1, 2], dtype="S10")

The "S10" will create an array of length-10 strings. You have to decide how big will be big enough to hold all the data you want to hold.

“S10”将创建一个长度为 10 的字符串数组。您必须决定有多大才能容纳您想要保存的所有数据。

Answer 2

回答by Johny White

I got a "codec error" when I tried to use a non-ascii character with dtype="S10"

当我尝试使用非 ascii 字符时出现“编解码器错误” dtype="S10"

You also get an array with binary strings, which confused me.

你还会得到一个包含二进制字符串的数组，这让我很困惑。

I think it is better to use:

我认为最好使用：

my_array = numpy.empty([1, 2], dtype="<U10")

Here 'U10' translates to "Unicode string of length 10; little endian format"

这里的“U10”翻译为“长度为 10 的 Unicode 字符串；小端格式”

Answer 3

回答by spinup

The numpy string array is limited by its fixed length (length 1 by default). If you're unsure what length you'll need for your strings in advance, you can use dtype=objectand get arbitrary length strings for your data elements:

numpy 字符串数组受其固定长度（默认长度为 1）的限制。如果您事先不确定您的字符串需要多长的长度，您可以dtype=object为您的数据元素使用和获取任意长度的字符串：

my_array = numpy.empty([1, 2], dtype=object)

I understand there may be efficiency drawbacks to this approach, but I don't have a good reference to support that.

我知道这种方法可能存在效率缺陷，但我没有很好的参考资料来支持这一点。

Answer 4

回答by Plamen

Another alternative is to initialize as follows:

另一种选择是初始化如下：

my_array = np.array([["CAT","APPLE"],['','']], dtype=str)

In other words, first you write a regular array with what you want, then you turn it into a numpy array. However, this will fix your max string length to the length of the longest string at initialization. So if you were to add

换句话说，首先你用你想要的东西写一个常规数组，然后把它变成一个 numpy 数组。但是，这会将您的最大字符串长度固定为初始化时最长字符串的长度。所以如果你要添加

my_array[1,0] = 'PINEAPPLE'

then the string stored would be 'PINEA'.

那么存储的字符串将是'PINEA'。

Answer 5

回答by KanDan

What works best if you are doing a for loop is to start a list comprehension, which will allow you to allocate the right memory.

如果您正在执行 for 循环，最好的方法是启动列表理解，这将允许您分配正确的内存。

data = ['CAT','APPLE,'CARROT']
my_array = [name for name in data]

Python 初始化字符串数据的 numpy 数组的奇怪行为

提问by Jim

采纳答案by BrenBarn

回答by Johny White

回答by spinup

回答by Plamen

回答by KanDan

相关推荐

最近更新

标签

Python 初始化字符串数据的 numpy 数组的奇怪行为

提问by Jim

采纳答案by BrenBarn

回答by Johny White

回答by spinup

回答by Plamen

回答by KanDan

相关推荐

如何在 Python 中比较日期和日期时间？

用于从美国城市名称获取纬度和经度的 Python 模块？

Python 在 matplotlib 中反转颜色图

Python 列表是否保证其元素按插入顺序保持不变？

相关推荐

最近更新

标签