Python 随机排列 DataFrame 行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29576430/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Shuffle DataFrame rows
提问by JNevens
I have the following DataFrame:
我有以下数据帧:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
...
20 7 8 9 2
21 10 11 12 2
...
45 13 14 15 3
46 16 17 18 3
...
The DataFrame is read from a csv file. All rows which have Type
1 are on top, followed by the rows with Type
2, followed by the rows with Type
3, etc.
DataFrame 是从 csv 文件中读取的。具有Type
1 的所有行在顶部,然后是具有Type
2的行,然后是具有Type
3的行,依此类推。
I would like to shuffle the order of the DataFrame's rows, so that all Type
's are mixed. A possible result could be:
我想打乱 DataFrame 行的顺序,以便所有Type
's 混合。一个可能的结果可能是:
Col1 Col2 Col3 Type
0 7 8 9 2
1 13 14 15 3
...
20 1 2 3 1
21 10 11 12 2
...
45 4 5 6 1
46 16 17 18 3
...
How can I achieve this?
我怎样才能做到这一点?
采纳答案by Kris
The idiomatic way to do this with Pandas is to use the .sample
method of your dataframe to sample all rows without replacement:
使用 Pandas 执行此操作的惯用方法是使用.sample
数据框的方法对所有行进行采样而无需替换:
df.sample(frac=1)
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means return all rows (in random order).
的frac
关键字参数指定的行的分数到随机样品中返回,所以frac=1
装置返回所有行(随机顺序)。
Note:If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
注意:如果您希望就地改组数据帧并重置索引,您可以执行例如
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
在这里,指定drop=True
可防止.reset_index
创建包含旧索引条目的列。
Follow-up note:Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the referenceobject has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
后续注意:虽然上面的操作看起来可能不是就地,但python/pandas足够聪明,不会对混洗后的对象再做一次malloc。也就是说,即使引用对象发生了变化(我的意思id(df_old)
是与 不同id(df_new)
),底层的 C 对象仍然相同。为了证明确实如此,您可以运行一个简单的内存分析器:
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
回答by joris
You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation
(but np.random.choice
is also a possibility):
您可以通过使用混洗索引进行索引来混洗数据帧的行。为此,您可以使用np.random.permutation
(但np.random.choice
也有可能):
In [12]: df = pd.read_csv(StringIO(s), sep="\s+")
In [13]: df
Out[13]:
Col1 Col2 Col3 Type
0 1 2 3 1
1 4 5 6 1
20 7 8 9 2
21 10 11 12 2
45 13 14 15 3
46 16 17 18 3
In [14]: df.iloc[np.random.permutation(len(df))]
Out[14]:
Col1 Col2 Col3 Type
46 16 17 18 3
45 13 14 15 3
20 7 8 9 2
0 1 2 3 1
1 4 5 6 1
21 10 11 12 2
If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)
如果你想保持索引从 1, 2, .., n 编号,就像你的例子一样,你可以简单地重置索引: df_shuffled.reset_index(drop=True)
回答by tj89
You can simply use sklearn for this
您可以简单地为此使用 sklearn
from sklearn.utils import shuffle
df = shuffle(df)
回答by haku
TL;DR: np.random.shuffle(ndarray)
can do the job.
So, in your case
TL;DR:np.random.shuffle(ndarray)
可以完成这项工作。
所以,在你的情况下
np.random.shuffle(DataFrame.values)
DataFrame
, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)
DataFrame
,在幕后,使用 NumPy ndarray 作为数据持有者。(您可以从DataFrame 源代码中查看)
So if you use np.random.shuffle()
, it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame
remains unshuffled.
因此,如果您使用np.random.shuffle()
,它将沿多维数组的第一个轴对数组进行洗牌。但索引DataFrame
保持不变。
Though, there are some points to consider.
尽管如此,还是有一些要点需要考虑。
- function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
sklearn.utils.shuffle()
, as user tj89 suggested, can designaterandom_state
along with another option to control output. You may want that for dev purpose.sklearn.utils.shuffle()
is faster. But WILL SHUFFLE the axis info(index, column) of theDataFrame
along with thendarray
it contains.
- 函数不返回。如果您想保留原始对象的副本,则必须在传递给函数之前这样做。
sklearn.utils.shuffle()
,正如用户 tj89 所建议的,可以random_state
与另一个选项一起指定来控制输出。您可能希望出于开发目的使用它。sklearn.utils.shuffle()
是比较快的。但洗牌的轴信息(索引,列)DataFrame
与沿ndarray
它包含的内容。
Benchmark result
基准测试结果
between sklearn.utils.shuffle()
and np.random.shuffle()
.
sklearn.utils.shuffle()
和之间np.random.shuffle()
。
ndarray
数组
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
0.10793248389381915 秒。快 8 倍
np.random.shuffle(nd)
0.8897626010002568 sec
0.8897626010002568 秒
DataFrame
数据框
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
0.3183923360193148 秒。快 3 倍
np.random.shuffle(df.values)
0.9357550159329548 sec
0.9357550159329548 秒
Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use
sklearn.utils.shuffle()
. Otherwise, usenp.random.shuffle()
结论:如果可以将轴信息(索引,列)与 ndarray 一起洗牌,请使用
sklearn.utils.shuffle()
. 否则,使用np.random.shuffle()
used code
使用代码
import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)
'''
timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
回答by Abhilash Reddy Yammanuru
shuffle the pandas data frame by taking a sample array in this case indexand randomize its order then set the array as an index of data frame. Now sort the data frame according to index. Here goes your shuffled dataframe
通过在这种情况下获取样本数组索引并随机化其顺序,然后将数组设置为数据帧的索引,从而对 Pandas 数据帧进行混洗。现在根据索引对数据框进行排序。这是你洗牌的数据框
import random
df = pd.DataFrame({"a":[1,2,3,4],"b":[5,6,7,8]})
index = [i for i in range(df.shape[0])]
random.shuffle(index)
df.set_index([index]).sort_index()
output
输出
a b
0 2 6
1 1 5
2 3 7
3 4 8
Insert you data frame in the place of mine in above code .
在上面的代码中将你的数据框插入我的位置。
回答by Ido Cohn
AFAIK the simplest solution is:
AFAIK 最简单的解决方案是:
df_shuffled = df.reindex(np.random.permutation(df.index))
回答by NotANumber
(I don't have enough reputation to comment this on the top post, so I hope someone else can do that for me.)There was a concern raised that the first method:
(我没有足够的声誉在顶级帖子上对此发表评论,所以我希望其他人可以为我做这件事。)有人担心第一种方法:
df.sample(frac=1)
made a deep copy or just changed the dataframe. I ran the following code:
进行了深层复制或只是更改了数据帧。我运行了以下代码:
print(hex(id(df)))
print(hex(id(df.sample(frac=1))))
print(hex(id(df.sample(frac=1).reset_index(drop=True))))
and my results were:
我的结果是:
0x1f8a784d400
0x1f8b9d65e10
0x1f8b9d65b70
which means the method is notreturning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.
这意味着该方法没有返回相同的对象,正如最后一条评论中所建议的那样。所以这个方法确实做了一个shuffled copy。
回答by soulmachine
Here is another way:
这是另一种方式:
df['rnd'] = np.random.rand(len(df))
df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)
df['rnd'] = np.random.rand(len(df))
df = df.sort_values(by='rnd', inplace=True).drop('rnd', axis=1)
回答by PV8
What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:
还有什么有用的,如果您将它用于 Machine_learning 并希望始终分离相同的数据,您可以使用:
df.sample(n=len(df), random_state=42)
this makes sure, that you keep your random choice always replicatable
这可以确保您的随机选择始终可复制