Python 将两个 numpy 数组同时混洗的更好方法

Question

提问by Josh Bleecher Snyder

I have two numpy arrays of different shapes, but with the same length (leading dimension). I want to shuffle each of them, such that corresponding elements continue to correspond -- i.e. shuffle them in unison with respect to their leading indices.

我有两个不同形状的 numpy 数组，但长度相同（前导维度）。我想对它们中的每一个进行洗牌，以便相应的元素继续对应——即根据它们的前导索引将它们统一洗牌。

This code works, and illustrates my goals:

此代码有效，并说明了我的目标：

def shuffle_in_unison(a, b):
    assert len(a) == len(b)
    shuffled_a = numpy.empty(a.shape, dtype=a.dtype)
    shuffled_b = numpy.empty(b.shape, dtype=b.dtype)
    permutation = numpy.random.permutation(len(a))
    for old_index, new_index in enumerate(permutation):
        shuffled_a[new_index] = a[old_index]
        shuffled_b[new_index] = b[old_index]
    return shuffled_a, shuffled_b

For example:

例如：

>>> a = numpy.asarray([[1, 1], [2, 2], [3, 3]])
>>> b = numpy.asarray([1, 2, 3])
>>> shuffle_in_unison(a, b)
(array([[2, 2],
       [1, 1],
       [3, 3]]), array([2, 1, 3]))

However, this feels clunky, inefficient, and slow, and it requires making a copy of the arrays -- I'd rather shuffle them in-place, since they'll be quite large.

然而，这感觉笨重、低效且缓慢，并且需要制作数组的副本——我宁愿将它们就地洗牌，因为它们会非常大。

Is there a better way to go about this? Faster execution and lower memory usage are my primary goals, but elegant code would be nice, too.

有没有更好的方法来解决这个问题？更快的执行和更低的内存使用是我的主要目标，但优雅的代码也会很好。

One other thought I had was this:

我的另一个想法是：

def shuffle_in_unison_scary(a, b):
    rng_state = numpy.random.get_state()
    numpy.random.shuffle(a)
    numpy.random.set_state(rng_state)
    numpy.random.shuffle(b)

This works...but it's a little scary, as I see little guarantee it'll continue to work -- it doesn't look like the sort of thing that's guaranteed to survive across numpy version, for example.

这有效......但它有点可怕，因为我几乎不能保证它会继续工作 - 例如，它看起来不像保证在 numpy 版本中存活的那种东西。

Answer 1

采纳答案by Sven Marnach

Your "scary" solution does not appear scary to me. Calling shuffle()for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

你的“可怕”解决方案对我来说并不可怕。调用shuffle()两个长度相同的序列会导致对随机数生成器的调用次数相同，而这些是 shuffle 算法中唯一的“随机”元素。通过重置状态，您可以确保对随机数生成器的调用在对的第二次调用中给出相同的结果shuffle()，因此整个算法将生成相同的排列。

If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

如果您不喜欢这样，另一种解决方案是将您的数据存储在一个数组中，而不是从一开始就存储在两个数组中，然后在这个模拟您现在拥有的两个数组的单个数组中创建两个视图。您可以将单个数组用于洗牌，将视图用于所有其他目的。

Example: Let's assume the arrays aand blook like this:

例如：假设数组a和b这个样子的：

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:

我们现在可以构造一个包含所有数据的数组：

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original aand b:

现在我们创建模拟原始视图a和b：

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2and b2is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

和的数据a2与b2共享c。要同时打乱两个数组，请使用numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original aand bat all and right away create c, a2and b2.

在生产代码，你当然会尽量避免创建原始a和b根本，并马上创建c，a2和b2。

This solution could be adapted to the case that aand bhave different dtypes.

该解决方案能够适应的情况下a，并b有不同的dtypes。

Answer 2

回答by DaveP

If you want to avoid copying arrays, then I would suggest that instead of generating a permutation list, you go through every element in the array, and randomly swap it to another position in the array

如果您想避免复制数组，那么我建议您不要生成排列列表，而是遍历数组中的每个元素，并将其随机交换到数组中的另一个位置

for old_index in len(a):
    new_index = numpy.random.randint(old_index+1)
    a[old_index], a[new_index] = a[new_index], a[old_index]
    b[old_index], b[new_index] = b[new_index], b[old_index]

This implements the Knuth-Fisher-Yates shuffle algorithm.

这实现了 Knuth-Fisher-Yates shuffle 算法。

Answer 3

回答by mtrw

Your can use NumPy's array indexing:

您可以使用 NumPy 的数组索引：

def unison_shuffled_copies(a, b):
    assert len(a) == len(b)
    p = numpy.random.permutation(len(a))
    return a[p], b[p]

This will result in creation of separate unison-shuffled arrays.

这将导致创建单独的 unison-shuffled 数组。

Answer 4

回答by James

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y, random_state=0)

To learn more, see http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

要了解更多信息，请参阅http://scikit-learn.org/stable/modules/generated/sklearn.utils.shuffle.html

Answer 5

回答by ajfbiw.s

With an example, this is what I'm doing:

举个例子，这就是我在做什么：

combo = []
for i in range(60000):
    combo.append((images[i], labels[i]))

shuffle(combo)

im = []
lab = []
for c in combo:
    im.append(c[0])
    lab.append(c[1])
images = np.asarray(im)
labels = np.asarray(lab)

Answer 6

回答by connor

Very simple solution:

非常简单的解决方案：

randomize = np.arange(len(x))
np.random.shuffle(randomize)
x = x[randomize]
y = y[randomize]

the two arrays x,y are now both randomly shuffled in the same way

两个数组 x,y 现在都以相同的方式随机打乱

Answer 7

回答by Ivo

I extended python's random.shuffle() to take a second arg:

我扩展了 python 的 random.shuffle() 以获取第二个参数：

def shuffle_together(x, y):
    assert len(x) == len(y)

    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random.random() * (i+1))
        x[i], x[j] = x[j], x[i]
        y[i], y[j] = y[j], y[i]

That way I can be sure that the shuffling happens in-place, and the function is not all too long or complicated.

这样我就可以确保改组就地发生，而且函数不会太长或太复杂。

Answer 8

回答by Adam Snaider

One way in which in-place shuffling can be done for connected lists is using a seed (it could be random) and using numpy.random.shuffle to do the shuffling.

可以对连接列表进行就地改组的一种方法是使用种子（它可以是随机的）并使用 numpy.random.shuffle 进行改组。

# Set seed to a random number if you want the shuffling to be non-deterministic.
def shuffle(a, b, seed):
   np.random.seed(seed)
   np.random.shuffle(a)
   np.random.seed(seed)
   np.random.shuffle(b)

That's it. This will shuffle both a and b in the exact same way. This is also done in-place which is always a plus.

就是这样。这将以完全相同的方式对 a 和 b 进行混洗。这也是就地完成的，这总是一个优点。

EDIT, don't use np.random.seed() use np.random.RandomState instead

编辑，不要使用 np.random.seed() 而是使用 np.random.RandomState

def shuffle(a, b, seed):
   rand_state = np.random.RandomState(seed)
   rand_state.shuffle(a)
   rand_state.seed(seed)
   rand_state.shuffle(b)

When calling it just pass in any seed to feed the random state:

调用它时，只需传入任何种子即可提供随机状态：

a = [1,2,3,4]
b = [11, 22, 33, 44]
shuffle(a, b, 12345)

Output:

输出：

>>> a
[1, 4, 2, 3]
>>> b
[11, 44, 22, 33]

Edit: Fixed code to re-seed the random state

编辑：修复了重新播种随机状态的代码

Answer 9

回答by mohammad hassan bigdeli shamlo

you can make an array like:

你可以制作一个数组，如：

s = np.arange(0, len(a), 1)

then shuffle it:

然后洗牌：

np.random.shuffle(s)

now use this s as argument of your arrays. same shuffled arguments return same shuffled vectors.

现在使用 this 作为数组的参数。相同的混洗参数返回相同的混洗向量。

x_data = x_data[s]
x_label = x_label[s]

Answer 10

回答by Daniel

James wrote in 2015 an sklearn solutionwhich is helpful. But he added a random state variable, which is not needed. In the below code, the random state from numpy is automatically assumed.

James 在 2015 年写了一个 sklearn解决方案，很有帮助。但是他添加了一个随机状态变量，这是不需要的。在下面的代码中，自动假定了 numpy 的随机状态。

X = np.array([[1., 0.], [2., 1.], [0., 0.]])
y = np.array([0, 1, 2])
from sklearn.utils import shuffle
X, y = shuffle(X, y)

Python 将两个 numpy 数组同时混洗的更好方法

提问by Josh Bleecher Snyder

采纳答案by Sven Marnach

回答by DaveP

回答by mtrw

回答by James

回答by ajfbiw.s

回答by connor

回答by Ivo

回答by Adam Snaider

EDIT, don't use np.random.seed() use np.random.RandomState instead

编辑，不要使用 np.random.seed() 而是使用 np.random.RandomState

回答by mohammad hassan bigdeli shamlo

回答by Daniel

相关推荐

最近更新

标签

Python 将两个 numpy 数组同时混洗的更好方法

提问by Josh Bleecher Snyder

采纳答案by Sven Marnach

回答by DaveP

回答by mtrw

回答by James

回答by ajfbiw.s

回答by connor

回答by Ivo

回答by Adam Snaider

EDIT, don't use np.random.seed() use np.random.RandomState instead

编辑，不要使用 np.random.seed() 而是使用 np.random.RandomState

回答by mohammad hassan bigdeli shamlo

回答by Daniel

相关推荐

Python 循环遍历文本文件，readline() 构造在大文件上失败

Python glob 多种文件类型

为python脚本创建BAT文件

Python dict 通过 json.loads 转换为 JSON：

相关推荐

最近更新

标签