Python 随机排列 DataFrame 行

Question

提问by JNevens

I have the following DataFrame:

我有以下数据帧：

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
...
20     7     8     9     2
21    10    11    12     2
...
45    13    14    15     3
46    16    17    18     3
...

The DataFrame is read from a csv file. All rows which have Type1 are on top, followed by the rows with Type2, followed by the rows with Type3, etc.

DataFrame 是从 csv 文件中读取的。具有Type1 的所有行在顶部，然后是具有Type2的行，然后是具有Type3的行，依此类推。

I would like to shuffle the order of the DataFrame's rows, so that all Type's are mixed. A possible result could be:

我想打乱 DataFrame 行的顺序，以便所有Type's 混合。一个可能的结果可能是：

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
...
20     1     2     3     1
21    10    11    12     2
...
45     4     5     6     1
46    16    17    18     3
...

How can I achieve this?

我怎样才能做到这一点？

Answer 1

采纳答案by Kris

The idiomatic way to do this with Pandas is to use the .samplemethod of your dataframe to sample all rows without replacement:

使用 Pandas 执行此操作的惯用方法是使用.sample数据框的方法对所有行进行采样而无需替换：

df.sample(frac=1)

The frackeyword argument specifies the fraction of rows to return in the random sample, so frac=1means return all rows (in random order).

的frac关键字参数指定的行的分数到随机样品中返回，所以frac=1装置返回所有行（随机顺序）。

Note:If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

注意：如果您希望就地改组数据帧并重置索引，您可以执行例如

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=Trueprevents .reset_indexfrom creating a column containing the old index entries.

在这里，指定drop=True可防止.reset_index创建包含旧索引条目的列。

Follow-up note:Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the referenceobject has changed (by which I mean id(df_old)is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

后续注意：虽然上面的操作看起来可能不是就地，但python/pandas足够聪明，不会对混洗后的对象再做一次malloc。也就是说，即使引用对象发生了变化（我的意思id(df_old)是与不同id(df_new)），底层的 C 对象仍然相同。为了证明确实如此，您可以运行一个简单的内存分析器：

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

Answer 2

回答by joris

You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation(but np.random.choiceis also a possibility):

您可以通过使用混洗索引进行索引来混洗数据帧的行。为此，您可以使用np.random.permutation（但np.random.choice也有可能）：

In [12]: df = pd.read_csv(StringIO(s), sep="\s+")

In [13]: df
Out[13]: 
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]: 
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)

如果你想保持索引从 1, 2, .., n 编号，就像你的例子一样，你可以简单地重置索引： df_shuffled.reset_index(drop=True)

Answer 3

回答by tj89

You can simply use sklearn for this

您可以简单地为此使用 sklearn

from sklearn.utils import shuffle
df = shuffle(df)

Answer 4

回答by haku

TL;DR: np.random.shuffle(ndarray)can do the job.
So, in your case

TL;DR：np.random.shuffle(ndarray)可以完成这项工作。
所以，在你的情况下

np.random.shuffle(DataFrame.values)

DataFrame, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)

DataFrame，在幕后，使用 NumPy ndarray 作为数据持有者。（您可以从DataFrame 源代码中查看）

So if you use np.random.shuffle(), it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrameremains unshuffled.

因此，如果您使用np.random.shuffle()，它将沿多维数组的第一个轴对数组进行洗牌。但索引DataFrame保持不变。

Though, there are some points to consider.

尽管如此，还是有一些要点需要考虑。

function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
sklearn.utils.shuffle(), as user tj89 suggested, can designate random_statealong with another option to control output. You may want that for dev purpose.
sklearn.utils.shuffle()is faster. But WILL SHUFFLE the axis info(index, column) of the DataFramealong with the ndarrayit contains.

函数不返回。如果您想保留原始对象的副本，则必须在传递给函数之前这样做。
sklearn.utils.shuffle()，正如用户 tj89 所建议的，可以random_state与另一个选项一起指定来控制输出。您可能希望出于开发目的使用它。
sklearn.utils.shuffle()是比较快的。但洗牌的轴信息（索引，列）DataFrame与沿ndarray它包含的内容。

Benchmark result

基准测试结果

between sklearn.utils.shuffle()and np.random.shuffle().

sklearn.utils.shuffle()和之间np.random.shuffle()。

ndarray

数组

nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 8x faster

0.10793248389381915 秒。快 8 倍

np.random.shuffle(nd)

0.8897626010002568 sec

0.8897626010002568 秒

DataFrame

数据框

df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 3x faster

0.3183923360193148 秒。快 3 倍

np.random.shuffle(df.values)

0.9357550159329548 sec

0.9357550159329548 秒

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()

结论：如果可以将轴信息（索引，列）与 ndarray 一起洗牌，请使用sklearn.utils.shuffle(). 否则，使用np.random.shuffle()

used code

使用代码

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

python benchmarking

python 基准测试

Answer 5

回答by Abhilash Reddy Yammanuru

shuffle the pandas data frame by taking a sample array in this case indexand randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe

通过在这种情况下获取样本数组索引并随机化其顺序，然后将数组设置为数据帧的索引，从而对 Pandas 数据帧进行混洗。现在根据索引对数据框进行排序。这是你洗牌的数据框

import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()

output

输出

Insert you data frame in the place of mine in above code .

在上面的代码中将你的数据框插入我的位置。

Answer 6

回答by Ido Cohn

AFAIK the simplest solution is:

AFAIK 最简单的解决方案是：

df_shuffled = df.reindex(np.random.permutation(df.index))

Answer 7

回答by NotANumber

(I don't have enough reputation to comment this on the top post, so I hope someone else can do that for me.)There was a concern raised that the first method:

（我没有足够的声誉在顶级帖子上对此发表评论，所以我希望其他人可以为我做这件事。）有人担心第一种方法：

df.sample(frac=1)

made a deep copy or just changed the dataframe. I ran the following code:

进行了深层复制或只是更改了数据帧。我运行了以下代码：

print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))

and my results were:

我的结果是：

0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70

which means the method is notreturning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.

这意味着该方法没有返回相同的对象，正如最后一条评论中所建议的那样。所以这个方法确实做了一个shuffled copy。

Answer 8

回答by soulmachine

Here is another way:

这是另一种方式：

df['rnd'] = np.random.rand(len(df)) df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)

Answer 9

回答by PV8

What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:

还有什么有用的，如果您将它用于 Machine_learning 并希望始终分离相同的数据，您可以使用：

df.sample(n=len(df), random_state=42)

this makes sure, that you keep your random choice always replicatable

这可以确保您的随机选择始终可复制

Python 随机排列 DataFrame 行

提问by JNevens

采纳答案by Kris

回答by joris

回答by tj89

回答by haku

Benchmark result

基准测试结果

ndarray

数组

DataFrame

数据框

used code

使用代码

回答by Abhilash Reddy Yammanuru

回答by Ido Cohn

回答by NotANumber

回答by soulmachine

回答by PV8

相关推荐

最近更新

标签

Python 随机排列 DataFrame 行

提问by JNevens

采纳答案by Kris

回答by joris

回答by tj89

回答by haku

Benchmark result

基准测试结果

ndarray

数组

DataFrame

数据框

used code

使用代码

回答by Abhilash Reddy Yammanuru

回答by Ido Cohn

回答by NotANumber

回答by soulmachine

回答by PV8

相关推荐

Python Pandas：通过标签获取唯一的 MultiIndex 级别值

Python 从 Pandas 数据框中删除停用词

Python 关闭 Matplotlib 图形

Python SQLAlchemy ORM 转换为 Pandas DataFrame

相关推荐

最近更新

标签