bash 使用固定种子打乱文件的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5914513/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-17 23:57:10  来源:igfitidea点击:

Shuffling lines of a file with a fixed seed?

bashsorting

提问by Flethuseo

I want to shuffle the lines of a file with a fixed seed so that I always get the same random order. The command I am using is as follows:

我想用固定种子打乱文件的行,以便我始终获得相同的随机顺序。我使用的命令如下:

sort -R file.txt | head -200 > file.sff

What change could I make it so that it sorts with a fixed random seed?

我可以做出什么改变,以便它使用固定的随机种子进行排序?

回答by Charles Duffy

The GNU implementation of sorthas a --random-sourceargument. Passing this argument with the name of a file with known contents will result in a reliable set of output.

的 GNU 实现sort有一个--random-source论点。将此参数与具有已知内容的文件的名称一起传递将产生一组可靠的输出。

See the Random sourcesdocumentation in the GNU coreutils manual, which contains the following sample implementation and example:

请参阅GNU coreutils 手册中的随机源文档,其中包含以下示例实现和示例:

get_seeded_random()
{
  seed=""
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)
get_seeded_random()
{
  seed=""
  openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
    </dev/zero 2>/dev/null
}

shuf -i1-100 --random-source=<(get_seeded_random 42)

Since GNU sortis also part of coreutils, the relevant documentation applies there as well:

由于 GNUsort也是 coreutils 的一部分,因此相关文档也适用于那里:

sort --random-source=<(get_seeded_random 42) -R file.txt | head -200 > file.sff

回答by ghoti

You may not need to use external tools like sort, whose options and usage may vary depending on your operating system. Bash has an internal random number generator accessible through the $RANDOMvariable. It's common practice to seed the generator by setting the variable, like so:

您可能不需要使用诸如 之类的外部工具sort,其选项和用法可能因您的操作系统而异。Bash 有一个可通过$RANDOM变量访问的内部随机数生成器。通常的做法是通过设置变量来为生成器设置种子,如下所示:

RANDOM=$$

or

或者

RANDOM=$(date '+%s')

etc. But of course, you can also use a predictable seed in order to get predictable not-so-random results:

等等 但是当然,您也可以使用可预测的种子来获得可预测的非随机结果:

$ RANDOM=12345; echo $RANDOM
28207
$ RANDOM=12345; echo $RANDOM
28207

To reorder the lines of the mapped file randomly, you can read the file into an array using mapfile:

要随机重新排序映射文件的行,您可以使用 mapfile 将文件读入数组:

$ mapfile -t a < source.txt

Then simply rewrite the array indices:

然后简单地重写数组索引:

$ for i in ${!a[@]}; do a[$((RANDOM+${#a[@]}))]="${a[$i]}"; unset a[$i]; done

When reading a non-associative array, bash naturally orders elements in ascending order of index value.

读取非关联数组时,bash 自然会按索引值的升序对元素进行排序。

Note that the newindex for each line has the number of array elements added to it to avoid collisions within that range. This solution is still fallible -- there's no guarantee that $RANDOMwill produce unique numbers. You can mitigate that risk with extra code that checks for prior use of each index, or reduce the risk with bit-shifting:

请注意,每行的索引都添加了数组元素的数量,以避免在该范围内发生冲突。这个解决方案仍然是错误的——不能保证$RANDOM会产生唯一的数字。您可以通过检查每个索引的先前使用情况的额外代码来降低这种风险,或者通过位移来降低风险:

... a[$(( (RANDOM<<15)+RANDOM+${#a[@]} ))]= ...

This makes your index values into a 30-bit unsigned int instead of a 15 bit unsigned int.

这使您的索引值变为 30 位 unsigned int 而不是 15 位 unsigned int。

回答by David W.

If you're randomly shuffling lines, you're not sorting. I haven't seen a sortwith --random-sourceprompt before. It'd be interesting if it does exist. However, that's not sorting the lines in a fixed order.

如果您随机排列行,则不会进行排序。我以前从未见过sortwith--random-source提示。如果它确实存在会很有趣。但是,这并不是按固定顺序对行进行排序。

I believe you'll have to write a program to that, and I don't think Bash can quite do it.

我相信您必须为此编写一个程序,而且我认为 Bash 无法做到这一点。

Actually, it might. The $RANDOM environment variable selects a random number from 0 to 32767. You can assign a seed to RANDOMand the random number sequence will appear over and over. You can use a card dealing algorithm. Read in each line into a Bash array, then use the card dealing algorithm to pick each line.

事实上,它可能。$RANDOM 环境变量从 0 到 32767 中选择一个随机数。您可以为其分配一个种子,RANDOM随机数序列将一遍又一遍地出现。您可以使用发牌算法。将每一行读入一个 Bash 数组,然后使用发牌算法来挑选每一行。

I'm not going to write a test program -- especially in Bash, but you should get the idea.

我不会编写测试程序——尤其是在 Bash 中,但您应该明白这个想法。