拆分数据集中的Python随机状态

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/42191717/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:21:51  来源:igfitidea点击:

Python random state in splitting dataset

pythonrandommachine-learningscikit-learn

提问by Shelly

I'm kind of new to python. can anyone tell me why we set random state to zero in splitting train and test set.

我对python有点陌生。谁能告诉我为什么我们在拆分训练集和测试集时将随机状态设置为零。

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=0)

I have seen situations like this where random state is set to one!

我见过这样的情况,其中随机状态设置为 1!

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=1)

What is the consequence of this random state in cross validation as well?

这种随机状态在交叉验证中的后果是什么?

回答by Vivek Kumar

It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42used in many official examples of scikit as well as elsewhere also.

random_state 是 0 还是 1 或任何其他整数都没有关系。重要的是它应该设置相同的值,如果你想在代码的多次运行中验证你的处理。顺便说一下,我已经random_state=42在许多 scikit 的官方示例以及其他地方看到过使用。

random_stateas the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. In the documentation, it is stated that:

random_state顾名思义,用于初始化内部随机数生成器,在您的情况下,它将决定将数据拆分为训练和测试索引。在文档中,它指出:

If random_state is None or np.random, then a randomly-initialized RandomState object is returned.

If random_state is an integer, then it is used to seed a new RandomState object.

If random_state is a RandomState object, then it is passed through.

如果 random_state 是 None 或 np.random,则返回一个随机初始化的 RandomState 对象。

如果 random_state 是一个整数,则它用于为新的 RandomState 对象播种。

如果 random_state 是一个 RandomState 对象,则通过它。

This is to check and validate the data when running the code multiple times. Setting random_statea fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

这是为了在多次运行代码时检查和验证数据。设置random_state固定值将保证每次运行代码时生成相同的随机数序列。除非过程中存在其他一些随机性,否则产生的结果将与往常一样。这有助于验证输出。

回答by Ganesh

The random_state splits a randomly selected data but with a twist. And the twist is the order of the data will be same for a particular value of random_state.You need to understand that it's not a bool accpeted value. starting from 0 to any integer no, if you pass as random_state,it'll be a permanent order for it. Ex: the order you will get in random_state=0remain same. After that if you execuit random_state=5and again come back to random_state=0you'll get the same order. And like 0 for all integer will go same. How ever random_state=Nonesplits randomly each time.

random_state 拆分随机选择的数据,但有一些扭曲。并且扭曲是对于 random_state 的特定值,数据的顺序将相同。您需要了解它不是 bool 接受的值。从 0 开始到任何整数 no,如果您作为 random_state 传递,它将成为它的永久订单。例如:您将获得的订单random_state=0保持不变。之后,如果您执行random_state=5并再次返回,random_state=0您将获得相同的订单。就像所有整数的 0 一样。如何random_state=None每次随机分裂。

If still having doubt watch this

如果还有疑问看这个

回答by Rishi Bansal

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

如果您没有在代码中提及 random_state,那么每当您执行代码时,都会生成一个新的随机值,并且每次训练和测试数据集都会有不同的值。

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.

但是,如果您每次都使用 random_state(random_state = 1 或任何其他值) 的特定值,结果将相同,即训练和测试数据集中的值相同。

回答by Debasish Bhol

We used the random_state parameter for reproducibility of the initial shuffling of training datasets after each epoch.

我们使用 random_state 参数来重现每个 epoch 后训练数据集的初始改组。

回答by San

when random_state set to an integer, train_test_split will return sameresults for each execution.

当 random_state 设置为integer 时, train_test_split 将为每次执行返回相同的结果。

when random_state set to an None, train_test_split will return differentresults for each execution.

当 random_state 设置为None 时, train_test_split 将为每次执行返回不同的结果。

see below example:

见下面的例子:

from sklearn.model_selection import train_test_split

X_data = range(10)
y_data = range(10)

for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = 0) # zero or any other integer
    print(y_test)

print("*"*30)

for i in range(5): 
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = None)
    print(y_test)

Output:

输出

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]

[2, 8, 4]



[4, 7, 6]

[4, 7, 6]

[4, 3, 7]

[4, 3, 7]

[8, 1, 4]

[8, 1, 4]

[9, 5, 8]

[9, 5, 8]

[6, 4, 5]

[6, 4, 5]

回答by user13140964

random_state is None by default which means every time when you run your program you will get different output because of splitting between train and test varies within.

random_state 默认为 None ,这意味着每次运行程序时,您都会得到不同的输出,因为训练和测试之间的拆分在内部有所不同。

random_state = any int value means every time when you run your program you will get tehe same output because of splitting between train and test does not varies within.

random_state = any int 值意味着每次运行程序时,您都会得到相同的输出,因为训练和测试之间的拆分不会在内部发生变化。

回答by hari

for multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets.it fixes the order of data for train_test_split

对于我们模型的多次执行,随机状态确保训练和测试数据集的数据值相同。它修复了 train_test_split 的数据顺序

回答by Farzana Khan

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

如果您没有在代码中指定 random_state,那么每次运行(执行)代码时都会生成一个新的随机值,并且每次训练和测试数据集都会有不同的值。

However, if a fixed value is assigned like random_state = 0 or 1 or 42then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

但是,如果像random_state = 0 或 1 或 42这样分配固定值,那么无论您执行代码多少次,结果都是相同的,即训练和测试数据集中的值相同。