python 从python中的单词列表中返回一个随机单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1456617/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 22:18:07  来源:igfitidea点击:

Return a random word from a word list in python

python

提问by kzh

I would like to retrieve a random word from a file using python, but I do not believe my following method is best or efficient. Please assist.

我想使用 python 从文件中检索一个随机单词,但我不相信我的以下方法是最好的或有效的。请协助。

import fileinput
import _random
file = [line for line in fileinput.input("/etc/dictionaries-common/words")]
rand = _random.Random()
print file[int(rand.random() * len(file))],

回答by dcrosta

The random module defines choice(), which does what you want:

random 模块定义了choice(),它可以执行您想要的操作:

import random

words = [line.strip() for line in open('/etc/dictionaries-common/words')]
print(random.choice(words))

Note also that this assumes that each word is by itself on a line in the file. If the file is very big, or if you perform this operation frequently, you may find that constantly rereading the file impacts your application's performance negatively.

另请注意,这假定每个单词都在文件中的一行中。如果文件非常大,或者您经常执行此操作,您可能会发现不断重新读取文件会对应用程序的性能产生负面影响。

回答by Nadia Alramli

Another solution is to use getline

另一种解决方案是使用getline

import linecache
import random
line_number = random.randint(0, total_num_lines)
linecache.getline('/etc/dictionaries-common/words', line_number)

From the documentation:

从文档:

The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file

linecache 模块允许从任何文件中获取任何行,同时尝试使用缓存进行内部优化,这是从单个文件中读取多行的常见情况

EDIT: You can calculate the total number once and store it, since the dictionary file is unlikely to change.

编辑:您可以计算一次总数并存储它,因为字典文件不太可能改变。

回答by jfs

>>> import random
>>> random.choice(list(open('/etc/dictionaries-common/words')))
'jaundiced\n'

It is efficient human-time-wise.

它在人类时间方面是有效的。

btw, your implementation coincides with the one from stdlib's random.py:

顺便说一句,您的实现与 stdlib 的实现一致random.py

 def choice(self, seq):
    """Choose a random element from a non-empty sequence."""
    return seq[int(self.random() * len(seq))]  

Measure time performance

衡量时间表现

I was wondering what is the relative performance of the presented solutions. linecache-based is the obvious favorite. How much slower is the random.choice's one-liner compared to honest algorithm implemented in select_random_line()?

我想知道所提出的解决方案的相对性能是什么。linecache-based 是明显的最爱。random.choice与在 中实现的诚实算法相比, 的单行算法慢多少select_random_line()

# nadia_known_num_lines   9.6e-06 seconds 1.00
# nadia                   0.056 seconds 5843.51
# jfs                     0.062 seconds 1.10
# dcrosta_no_strip        0.091 seconds 1.48
# dcrosta                 0.13 seconds 1.41
# mark_ransom_no_strip    0.66 seconds 5.10
# mark_ransom_choose_from 0.67 seconds 1.02
# mark_ransom             0.69 seconds 1.04

(Each function is called 10 times (cached performance)).

(每个函数被调用 10 次(缓存性能))。

These result show that simple solution (dcrosta) is faster in this case than a more deliberate one (mark_ransom).

这些结果表明,dcrosta在这种情况下,简单的解决方案 ( ) 比更深思熟虑的解决方案 ( ) 更快mark_ransom

Code that was used for comparison (as a gist):

用于比较的代码(作为要点):

import linecache
import random
from timeit import default_timer


WORDS_FILENAME = "/etc/dictionaries-common/words"


def measure(func):
    measure.func_to_measure.append(func)
    return func
measure.func_to_measure = []


@measure
def dcrosta():
    words = [line.strip() for line in open(WORDS_FILENAME)]
    return random.choice(words)


@measure
def dcrosta_no_strip():
    words = [line for line in open(WORDS_FILENAME)]
    return random.choice(words)


def select_random_line(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line.strip()
            count = count + 1
    return selection


@measure
def mark_ransom():
    return select_random_line(WORDS_FILENAME)


def select_random_line_no_strip(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line
            count = count + 1
    return selection


@measure
def mark_ransom_no_strip():
    return select_random_line_no_strip(WORDS_FILENAME)


def choose_from(iterable):
    """Choose a random element from a finite `iterable`.

    If `iterable` is a sequence then use `random.choice()` for efficiency.

    Return tuple (random element, total number of elements)
    """
    selection, i = None, None
    for i, item in enumerate(iterable):
        if random.randint(0, i) == 0:
            selection = item

    return selection, (i+1 if i is not None else 0)


@measure
def mark_ransom_choose_from():
    return choose_from(open(WORDS_FILENAME))


@measure
def nadia():
    global total_num_lines
    total_num_lines = sum(1 for _ in open(WORDS_FILENAME))

    line_number = random.randint(0, total_num_lines)
    return linecache.getline(WORDS_FILENAME, line_number)


@measure
def nadia_known_num_lines():
    line_number = random.randint(0, total_num_lines)
    return linecache.getline(WORDS_FILENAME, line_number)


@measure
def jfs():
    return random.choice(list(open(WORDS_FILENAME)))


def timef(func, number=1000, timer=default_timer):
    """Return number of seconds it takes to execute `func()`."""
    start = timer()
    for _ in range(number):
        func()
    return (timer() - start) / number


def main():
    # measure time
    times = dict((f.__name__, timef(f, number=10))
                 for f in measure.func_to_measure)

    # print from fastest to slowest
    maxname_len = max(map(len, times))
    last = None
    for name in sorted(times, key=times.__getitem__):
        print "%s %4.2g seconds %.2f" % (name.ljust(maxname_len), times[name],
                                         last and times[name] / last or 1)
        last = times[name]


if __name__ == "__main__":
    main()

回答by Mark Ransom

Pythonizing my answer from What's the best way to return a random line in a text file using C?:

将我的答案 Python 化为使用 C 在文本文件中返回随机行的最佳方法是什么?

import random

def select_random_line(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line.strip()
        count = count + 1
    return selection

print select_random_line("/etc/dictionaries-common/words")

Edit: the original version of my answer used readlines, which didn't work as I thought and was totally unnecessary. This version will iterate through the file instead of reading it all into memory, and do it in a single pass, which should make it much more efficient than any answer I've seen thus far.

编辑:我使用的答案的原始版本readlines,它没有像我想象的那样工作,完全没有必要。此版本将遍历文件而不是将其全部读入内存,并一次性完成,这应该使它比我迄今为止看到的任何答案都更有效率。

Generalized version

通用版

import random

def choose_from(iterable):
    """Choose a random element from a finite `iterable`.

    If `iterable` is a sequence then use `random.choice()` for efficiency.

    Return tuple (random element, total number of elements)
    """
    selection, i = None, None
    for i, item in enumerate(iterable):
        if random.randint(0, i) == 0:
            selection = item

    return selection, (i+1 if i is not None else 0)

Examples

例子

print choose_from(open("/etc/dictionaries-common/words"))
print choose_from(dict(a=1, b=2))
print choose_from(i for i in range(10) if i % 3 == 0)
print choose_from(i for i in range(10) if i % 11 == 0 and i) # empty
print choose_from([0]) # one element
chunk, n = choose_from(urllib2.urlopen("http://google.com"))
print (chunk[:20], n)

Output

输出

('yeps\n', 98569)
('a', 2)
(6, 4)
(None, 0)
(0, 1)
('window._gjp && _gjp(', 10)

回答by Jason Christa

I don't have code for you but as far as an algorithm goes:

我没有给你的代码,但就算法而言:

  1. Find the file's size
  2. Do a random seek with the seek() function
  3. Find the next (or previous) whitespace character
  4. Return the word that starts after that whitespace character
  1. 查找文件的大小
  2. 使用 seek() 函数进行随机搜索
  3. 查找下一个(或上一个)空白字符
  4. 返回在该空白字符之后开始的单词

回答by Greg Hewgill

You could do this without using fileinput:

您可以在不使用的情况下执行此操作fileinput

import random
data = open("/etc/dictionaries-common/words").readlines()
print random.choice(data)

I have also used datainstead of filebecause fileis a predefined type in Python.

我还使用了data而不是file因为file是 Python 中的预定义类型。

回答by Oli

Efficiency and verbosity aren't the same thing in this case. It's tempting to go for the most beautiful, pythonic approach that does everything in one or two lines but for file I/O, stick with classic fopen-style, low-level interaction, even if it does take up a few more lines of code.

在这种情况下,效率和冗长不是一回事。很想采用最漂亮的 Pythonic 方法,在一两行内完成所有事情,但对于文件 I/O,坚持使用经典的 fopen 风格的低级交互,即使它确实需要多行几行代码.

I could copy and paste some code and claim it to be my own (others can if they want) but have a look at this: http://mail.python.org/pipermail/tutor/2007-July/055635.html

我可以复制和粘贴一些代码并声称它是我自己的(如果他们愿意,其他人也可以)但看看这个:http: //mail.python.org/pipermail/tutor/2007-July/055635.html

回答by pafcu

There are a few different ways to optimize this problem. You can optimize for speed, or for space.

有几种不同的方法可以优化这个问题。您可以优化速度或空间。

If you want a quick but memory-hungry solution, read in the entire file using file.readlines() and then use random.choice()

如果你想要一个快速但需要内存的解决方案,请使用 file.readlines() 读入整个文件,然后使用 random.choice()

If you want a memory-efficient solution, first check the number of lines in the file by calling somefile.readline() repeatedly until it returns "", then generate a random number smaller then the number of lines (say, n), seek back to the beginning of the file, and finally call somefile.readline() n times. The next call to somefile.readline() will return the desired random line. This approach wastes no memory holding "unnecessary" lines. Of course, if you plan on getting lots of random lines from the file, this will be horribly inefficient, and it's better to just keep the entire file in memory, like in the first approach.

如果你想要一个内存高效的解决方案,首先通过反复调用 somefile.readline() 来检查文件中的行数,直到它返回 "",然后生成一个小于行数(比如 n)的随机数,seek回到文件的开头,最后调用 somefile.readline() n 次。下一次调用 somefile.readline() 将返回所需的随机行。这种方法不会浪费保存“不必要”行的内存。当然,如果您计划从文件中获取大量随机行,这将是非常低效的,最好将整个文件保存在内存中,就像第一种方法一样。