Python 将索引数组转换为 1-hot 编码的 numpy 数组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29831489/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 05:04:00  来源:igfitidea点击:

Convert array of indices to 1-hot encoded numpy array

pythonnumpymachine-learningnumpy-ndarrayone-hot-encoding

提问by James Atwood

Let's say I have a 1d numpy array

假设我有一个 1d numpy 数组

a = array([1,0,3])

I would like to encode this as a 2d 1-hot array

我想将其编码为 2d 1-hot 数组

b = array([[0,1,0,0], [1,0,0,0], [0,0,0,1]])

Is there a quick way to do this? Quicker than just looping over ato set elements of b, that is.

有没有快速的方法来做到这一点?比仅仅循环a设置 的元素更快b,也就是说。

采纳答案by YXD

Your array adefines the columns of the nonzero elements in the output array. You need to also define the rows and then use fancy indexing:

您的数组a定义了输出数组中非零元素的列。您还需要定义行,然后使用花式索引:

>>> a = np.array([1, 0, 3])
>>> b = np.zeros((a.size, a.max()+1))
>>> b[np.arange(a.size),a] = 1
>>> b
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

回答by K3---rnc

>>> values = [1, 0, 3]
>>> n_values = np.max(values) + 1
>>> np.eye(n_values)[values]
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.]])

回答by stackoverflowuser2010

Here is a function that converts a 1-D vector to a 2-D one-hot array.

这是一个将一维向量转换为二维单热数组的函数。

#!/usr/bin/env python
import numpy as np

def convertToOneHot(vector, num_classes=None):
    """
    Converts an input 1-D vector of integers into an output
    2-D array of one-hot vectors, where an i'th input value
    of j will set a '1' in the i'th row, j'th column of the
    output array.

    Example:
        v = np.array((1, 0, 4))
        one_hot_v = convertToOneHot(v)
        print one_hot_v

        [[0 1 0 0 0]
         [1 0 0 0 0]
         [0 0 0 0 1]]
    """

    assert isinstance(vector, np.ndarray)
    assert len(vector) > 0

    if num_classes is None:
        num_classes = np.max(vector)+1
    else:
        assert num_classes > 0
        assert num_classes >= np.max(vector)

    result = np.zeros(shape=(len(vector), num_classes))
    result[np.arange(len(vector)), vector] = 1
    return result.astype(int)

Below is some example usage:

下面是一些示例用法:

>>> a = np.array([1, 0, 3])

>>> convertToOneHot(a)
array([[0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1]])

>>> convertToOneHot(a, num_classes=10)
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

回答by David Nemeskey

I think the short answer is no. For a more generic case in ndimensions, I came up with this:

我认为简短的回答是否定的。对于n维度中更通用的情况,我想出了这个:

# For 2-dimensional data, 4 values
a = np.array([[0, 1, 2], [3, 2, 1]])
z = np.zeros(list(a.shape) + [4])
z[list(np.indices(z.shape[:-1])) + [a]] = 1

I am wondering if there is a better solution -- I don't like that I have to create those lists in the last two lines. Anyway, I did some measurements with timeitand it seems that the numpy-based (indices/arange) and the iterative versions perform about the same.

我想知道是否有更好的解决方案——我不喜欢我必须在最后两行中创建这些列表。无论如何,我做了一些测量,timeit似乎numpy基于 ( indices/ arange) 和迭代版本的性能大致相同。

回答by Franck Dernoncourt

You can use sklearn.preprocessing.LabelBinarizer:

您可以使用 sklearn.preprocessing.LabelBinarizer

Example:

例子:

import sklearn.preprocessing
a = [1,0,3]
label_binarizer = sklearn.preprocessing.LabelBinarizer()
label_binarizer.fit(range(max(a)+1))
b = label_binarizer.transform(a)
print('{0}'.format(b))

output:

输出:

[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]]

Amongst other things, you may initialize sklearn.preprocessing.LabelBinarizer()so that the output of transformis sparse.

除其他外,您可以初始化sklearn.preprocessing.LabelBinarizer()以便输出transform稀疏。

回答by Jodo

In case you are using keras, there is a built in utility for that:

如果您使用 keras,有一个内置的实用程序:

from keras.utils.np_utils import to_categorical   

categorical_labels = to_categorical(int_labels, num_classes=3)

And it does pretty much the same as @YXD's answer(see source-code).

它与@YXD 的答案几乎相同(请参阅源代码)。

回答by Aaron Lelevier

Here is an example function that I wrote to do this based upon the answers above and my own use case:

这是我根据上述答案和我自己的用例编写的示例函数:

def label_vector_to_one_hot_vector(vector, one_hot_size=10):
    """
    Use to convert a column vector to a 'one-hot' matrix

    Example:
        vector: [[2], [0], [1]]
        one_hot_size: 3
        returns:
            [[ 0.,  0.,  1.],
             [ 1.,  0.,  0.],
             [ 0.,  1.,  0.]]

    Parameters:
        vector (np.array): of size (n, 1) to be converted
        one_hot_size (int) optional: size of 'one-hot' row vector

    Returns:
        np.array size (vector.size, one_hot_size): converted to a 'one-hot' matrix
    """
    squeezed_vector = np.squeeze(vector, axis=-1)

    one_hot = np.zeros((squeezed_vector.size, one_hot_size))

    one_hot[np.arange(squeezed_vector.size), squeezed_vector] = 1

    return one_hot

label_vector_to_one_hot_vector(vector=[[2], [0], [1]], one_hot_size=3)

回答by Emil Melnikov

Just to elaborate on the excellent answerfrom K3---rnc, here is a more generic version:

只是在阐述出色答卷K3 --- RNC,这里是一个更宽泛的版本:

def onehottify(x, n=None, dtype=float):
    """1-hot encode x with the max value n (computed from data if n is None)."""
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    return np.eye(n, dtype=dtype)[x]

Also, here is a quick-and-dirty benchmark of this method and a method from the currently accepted answerby YXD(slightly changed, so that they offer the same API except that the latter works only with 1D ndarrays):

此外,这里是这种方法的快速和肮脏的基准,并从一个方法目前公认的答案YXD(微变,让他们提供相同的API但后者只能与1D ndarrays):

def onehottify_only_1d(x, n=None, dtype=float):
    x = np.asarray(x)
    n = np.max(x) + 1 if n is None else n
    b = np.zeros((len(x), n), dtype=dtype)
    b[np.arange(len(x)), x] = 1
    return b

The latter method is ~35% faster (MacBook Pro 13 2015), but the former is more general:

后一种方法快约 35%(MacBook Pro 13 2015),但前者更通用:

>>> import numpy as np
>>> np.random.seed(42)
>>> a = np.random.randint(0, 9, size=(10_000,))
>>> a
array([6, 3, 7, ..., 5, 8, 6])
>>> %timeit onehottify(a, 10)
188 μs ± 5.03 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit onehottify_only_1d(a, 10)
139 μs ± 2.78 μs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

回答by Hans T

I recently ran into a problem of same kind and found said solution which turned out to be only satisfying if you have numbers that go within a certain formation. For example if you want to one-hot encode following list:

我最近遇到了同样的问题,并找到了上述解决方案,结果证明只有当你的数字符合某种形式时才会令人满意。例如,如果您想对以下列表进行单热编码:

all_good_list = [0,1,2,3,4]

go ahead, the posted solutions are already mentioned above. But what if considering this data:

继续,上面已经提到了已发布的解决方案。但是如果考虑这些数据呢:

problematic_list = [0,23,12,89,10]

If you do it with methods mentioned above, you will likely end up with 90 one-hot columns. This is because all answers include something like n = np.max(a)+1. I found a more generic solution that worked out for me and wanted to share with you:

如果您使用上述方法进行操作,您可能会得到 90 个单热列。这是因为所有答案都包含类似n = np.max(a)+1. 我找到了一个更通用的解决方案,它对我有用,并想与您分享:

import numpy as np
import sklearn
sklb = sklearn.preprocessing.LabelBinarizer()
a = np.asarray([1,2,44,3,2])
n = np.unique(a)
sklb.fit(n)
b = sklb.transform(a)

I hope someone encountered same restrictions on above solutions and this might come in handy

我希望有人在上述解决方案上遇到相同的限制,这可能会派上用场

回答by D.Samchuk

Here is what I find useful:

以下是我认为有用的内容:

def one_hot(a, num_classes):
  return np.squeeze(np.eye(num_classes)[a.reshape(-1)])

Here num_classesstands for number of classes you have. So if you have avector with shape of (10000,)this function transforms it to (10000,C). Note that ais zero-indexed, i.e. one_hot(np.array([0, 1]), 2)will give [[1, 0], [0, 1]].

这里num_classes代表您拥有的课程数量。因此,如果您有a形状为(10000,) 的向量,则此函数会将其转换为(10000,C)。请注意,它a是零索引的,即one_hot(np.array([0, 1]), 2)会给出[[1, 0], [0, 1]].

Exactly what you wanted to have I believe.

正是你想要的,我相信。

PS: the source is Sequence models - deeplearning.ai

PS:来源是序列模型-deeplearning.ai