Pandas random_state 究竟是做什么的?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45211624/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What exactly does the Pandas random_state do?
提问by Newskooler
I have the following code where I use the Pandas random_state
我有以下代码,我使用 Pandas random_state
randomState = 123
sampleSize = 750
df = pd.read_csv(filePath, delim_whitespace=True)
df_s = df.sample(n=sampleSize, random_state=randomState)
This generates a sample dataframe df_s
. Every time I run the code with the same randomState
, I get the same sample df_s
. When I change the value from 123
to 12
the sample changes as well, so I guess that's what the random_state
does.
这会生成一个示例数据帧df_s
。每次我用相同的代码运行代码时randomState
,我都会得到相同的样本df_s
。当我从改变值123
,以12
样品的变化一样,所以我想这是什么random_state
呢。
My silly question: How do the number change affect the sample change? I read the Pandas documentationand the Numpy documentation, but could not get a clear picture.
我的愚蠢问题:数量变化如何影响样本变化?我阅读了Pandas 文档和Numpy 文档,但无法获得清晰的画面。
Any straight forward explanation with an example will be much appreciated.
任何带有示例的直接解释将不胜感激。
采纳答案by jotasi
As described in the documentation of pandas.DataFrame.sample
, the random_state
parameter accepts either an integer (as in your case) or a numpy.random.RandomState
, which is a container for a Mersenne Twister pseudo random number generator.
如 的文档中所述pandas.DataFrame.sample
,该random_state
参数接受整数(如您的情况)或 a numpy.random.RandomState
,它是 Mersenne Twister 伪随机数生成器的容器。
If you pass it an integer, it will use this as a seedfor a pseudo random number generator. As the name already says, the generator does not produce true randomness. It rather has an internal state (that you can get by calling np.random.get_state()
) which is initialized based on a seed. When initialized by the same seed, it will reproduce the same sequence of "random numbers".
如果你传递给它一个整数,它会使用它作为伪随机数生成器的种子。顾名思义,生成器不会产生真正的随机性。它有一个np.random.get_state()
基于种子初始化的内部状态(您可以通过调用获得)。当由相同的种子初始化时,它将重现相同的“随机数”序列。
If you pass it a RandomState it will use this (already initialized/seeded) RandomState to generate pseudo random numbers. This also allows you to get reproducible results by setting a fixed seed when initializing the RandomState and then passing this RandomState around. Actually you should prefer this over setting the seed of numpys internal RandomState. The reasoning being explained in this answerby Robert Kern and the comments to it. The idea is to have an independent stream which prevents other parts of the program to mess up your reproducibility by changing the seed of numpys internal RandomState.
如果你传递给它一个 RandomState,它将使用这个(已经初始化/种子化的)RandomState 来生成伪随机数。这还允许您通过在初始化 RandomState 时设置固定种子然后传递此 RandomState 来获得可重复的结果。实际上你应该更喜欢这个而不是设置 numpys 内部 RandomState 的种子。罗伯特·克恩 (Robert Kern)在此回答中解释的推理及其评论。这个想法是有一个独立的流,通过改变 numpys 内部 RandomState 的种子来防止程序的其他部分弄乱你的可重复性。