Python 随机混合300万行文件的行

Question

提问by Nigu

Everything is in the title. I'm wondering if any one knows a quick and with reasonable memory demands way of randomly mixing all the lines of a 3 million lines file. I guess it is not possible with a simple vim command, so any simple script using Python. I tried with python by using a random number generator, but did not manage to find a simple way out.

一切都在标题中。我想知道是否有人知道一种快速且具有合理内存需求的方法来随机混合 300 万行文件的所有行。我想使用简单的 vim 命令是不可能的，所以任何使用 Python 的简单脚本都是不可能的。我通过使用随机数生成器尝试使用 python，但没有设法找到简单的出路。

Answer 1

采纳答案by S.Lott

import random
with open('the_file','r') as source:
    data = [ (random.random(), line) for line in source ]
data.sort()
with open('another_file','w') as target:
    for _, line in data:
        target.write( line )

That should do it. 3 million lines will fit into most machine's memory unless the lines are HUGE (over 512 characters).

应该这样做。除非行很大（超过 512 个字符），否则 300 万行将适合大多数机器的内存。

Answer 2

回答by John Kugelman

Takes only a few seconds in Python:

在 Python 中只需几秒钟：

>>> import random
>>> lines = open('3mil.txt').readlines()
>>> random.shuffle(lines)
>>> open('3mil.txt', 'w').writelines(lines)

Answer 3

回答by fuzzyTew

On many systems the sortshell command takes -Rto randomize its input.

在许多系统上，sortshell 命令会-R随机化其输入。

Answer 4

回答by S.Lott

Here's another version

这是另一个版本

At the shell, use this.

在外壳上，使用它。

python decorate.py | sort | python undecorate.py

decorate.py

装饰.py

import sys
import random
for line in sys.stdin:
    sys.stdout.write( "{0}|{1}".format( random.random(), line ) )

undecorate.py

未装饰.py

import sys
for line in sys.stdin:
    _, _, data= line.partition("|")
    sys.stdout.write( line )

Uses almost no memory.

几乎不占用内存。

Answer 5

回答by sleepynate

This is the same as Mr. Kugelman's, but using vim's built-in python interface:

这个和库格尔曼先生的一样，但是使用vim内置的python接口：

:py import vim, random as r; cb = vim.current.buffer ; l = cb[:] ; r.shuffle(l) ; cb[:] = l

Answer 6

回答by Lennart Regebro

If you do notwant to load everything into memory and sort it there, you haveto store the lines on disk while doing random sorting. That will be very slow.

如果你不希望加载的一切到内存和排序它在那里，你必须到线存储在磁盘上，而这样做的随机排序。那会很慢。

Here is a very simple, stupid and slow version. Note that this may take a surprising amount of diskspace, and it will be very slow. I ran it with 300.000 lines, and it takes several minutes. 3 million lines could very well take an hour. So: Do it in memory. Really. It's not that big.

这是一个非常简单、愚蠢和缓慢的版本。请注意，这可能会占用大量磁盘空间，而且速度会非常慢。我用 300.000 行运行它，需要几分钟。300 万行可能需要一个小时。所以：在内存中做。真的。它没有那么大。

import os
import tempfile
import shutil
import random
tempdir = tempfile.mkdtemp()
print tempdir

files = []
# Split the lines:
with open('/tmp/sorted.txt', 'rt') as infile:
    counter = 0    
    for line in infile:
        outfilename = os.path.join(tempdir, '%09i.txt' % counter)
        with open(outfilename, 'wt') as outfile:
            outfile.write(line)
        counter += 1
        files.append(outfilename)

with open('/tmp/random.txt', 'wt') as outfile:
    while files:
        index = random.randint(0, len(files) - 1)
        filename = files.pop(index)
        outfile.write(open(filename, 'rt').read())

shutil.rmtree(tempdir)

Another version would be to store the files in an SQLite database and pull the lines randomly from that database. That is probably going to be faster than this.

另一个版本是将文件存储在 SQLite 数据库中并从该数据库中随机提取行。那可能会比这更快。

Answer 7

回答by Drag0

I just tried this on a file with 4.3M of lines and fastest thing was 'shuf' command on Linux. Use it like this:

我刚刚在一个 4.3M 行的文件上尝试了这个，最快的事情是 Linux 上的“shuf”命令。像这样使用它：

shuf huge_file.txt -o shuffled_lines_huge_file.txt

It took 2-3 seconds to finish.

完成需要 2-3 秒。

Answer 8

回答by Aziz Alto

Here is another way using random.choice, this may provide some gradual memory relieve as well, but with a worse Big-O :)

这是使用random.choice 的另一种方法，这也可以提供一些渐进的记忆缓解，但更糟糕的是 Big-O :)

from random import choice

with open('data.txt', 'r') as r:
    lines = r.readlines()

with open('shuffled_data.txt', 'w') as w:
    while lines:
        l = choice(lines)
        lines.remove(l)
        w.write(l)

Answer 9

回答by sergio

The following Vimscript can be used to swap lines:

以下 Vimscript 可用于交换行：

function! Random()                                                       
  let nswaps = 100                                                       
  let firstline = 1                                                     
  let lastline = 10                                                      
  let i = 0                                                              
  while i <= nswaps                                                      
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe line.'d'                                                         
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe "normal! " . line . 'Gp'                                         
    let i += 1                                                           
  endwhile                                                               
endfunction

Select the function in visual mode and type :@"then execute it with :call Random()

在可视模式下选择函数并键入:@"然后执行它:call Random()

Answer 10

回答by Kumaresp

This will do the trick: My solution even don't use random and it will also remove duplicates.

这可以解决问题：我的解决方案甚至不使用 random 并且它还会删除重复项。

import sys
lines= list(set(open(sys.argv[1]).readlines()))
print(' '.join(lines))

in the shell

在壳中

python shuffler.py nameoffilestobeshuffled.txt > shuffled.txt

Python 随机混合300万行文件的行

提问by Nigu

采纳答案by S.Lott

回答by John Kugelman

回答by fuzzyTew

回答by S.Lott

回答by sleepynate

回答by Lennart Regebro

回答by Drag0

回答by Aziz Alto

回答by sergio

回答by Kumaresp

相关推荐

最近更新

标签

Python 随机混合300万行文件的行

提问by Nigu

采纳答案by S.Lott

回答by John Kugelman

回答by fuzzyTew

回答by S.Lott

回答by sleepynate

回答by Lennart Regebro

回答by Drag0

回答by Aziz Alto

回答by sergio

回答by Kumaresp

相关推荐

Python - 在命令行模块运行期间添加 PYTHONPATH

Python 如何使用 Matplotlib 设置图形背景颜色的不透明度

Python 如何计算嵌套字典中的所有元素？

Python 如何在 Django 中获取用户 IP 地址？

相关推荐

最近更新

标签