bash 从文件中选择随机行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9245638/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 21:38:49  来源:igfitidea点击:

Select random lines from a file

bashshellrandomtext-processing

提问by user121196

In a Bash script, I want to pick out N random lines from input file and output to another file.

在 Bash 脚本中,我想从输入文件中挑选出 N 个随机行并输出到另一个文件。

How can this be done?

如何才能做到这一点?

回答by dogbane

Use shufwith the -noption as shown below, to get Nrandom lines:

shuf-n如下所示的选项一起使用,以获取N随机行:

shuf -n N input > output

回答by user881480

Sort the file randomly and pick first 100lines:

随机排序文件并选择第一100行:

$ sort -R input | head -n 100 >output

回答by Stein van Broekhoven

Well According to a comment on the shuf answer he shuffed 78 000 000 000 lines in under a minute.

好吧,根据对 shuf 回答的评论,他在一分钟内 shuff 了 78 000 000 000 行。

Challenge accepted...

已接受的挑战...

First I needed a file of 78.000.000.000 lines:

首先我需要一个 78.000.000.000 行的文件:

seq 1 78 | xargs -n 1 -P 16 -I% seq 1 1000 | xargs -n 1 -P 16 -I% echo "" > lines_78000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000.txt > lines_78000000.txt
seq 1 1000 | xargs -n 1 -P 16 -I% cat lines_78000000.txt > lines_78000000000.txt

This gives me a a file with 78 Billionnewlines ;-)

这给了我一个包含780 亿个换行符的文件;-)

Now for the shuf part:

现在是 shuf 部分:

$ time shuf -n 10 lines_78000000000.txt










shuf -n 10 lines_78000000000.txt  2171.20s user 22.17s system 99% cpu 36:35.80 total

The bottleneck was CPU and not using multiple threads, it pinned 1 core at 100% the other 15 were not used.

瓶颈是 CPU 并且没有使用多线程,它 100% 固定 1 个核心,其他 15 个没有使用。

Python is what I regularly use so that's what I'll use to make this faster:

Python 是我经常使用的,所以我将使用它来加快速度:

#!/bin/python3
import random
f = open("lines_78000000000.txt", "rt")
count = 0
while 1:
  buffer = f.read(65536)
  if not buffer: break
  count += buffer.count('\n')

for i in range(10):
  f.readline(random.randint(1, count))

This got me just under a minute:

这让我不到一分钟:

$ time ./shuf.py         










./shuf.py  42.57s user 16.19s system 98% cpu 59.752 total

I did this on a Lenovo X1 extreme 2nd gen with the i9 and Samsung NVMe which gives me plenty read and write speed.

我在带有 i9 和三星 NVMe 的 Lenovo X1 Extreme 2nd gen 上进行了此操作,这为我提供了充足的读写速度。

I know it can get faster but I'll leave some room to give others a try.

我知道它可以变得更快,但我会留出一些空间让其他人尝试。

Line counter source: Luther Blissett

线计数器来源:Luther Blissett