Python 从没有空格、标点符号的文本文件中创建每个单词的列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18135967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:58:26  来源:igfitidea点击:

Creating a list of every word from a text file without spaces, punctuation

python

提问by Tom F

I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.

我有一个长文本文件(剧本)。我想把这个文本文件变成一个列表(每个单词都被分隔开),以便我以后可以搜索它。

The code i have at the moment is

我目前的代码是

file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words

I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)

我认为这可以将所有单词分成一个列表,但是我无法删除单词末尾的所有额外内容,例如逗号和句号。我还想让大写字母小写(因为我希望能够以小写字母搜索并同时显示大写和小写单词)。任何帮助都会很棒:)

回答by Brian H

Use the replace method.

使用替换方法。

mystring = mystring.replace(",", "")

If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such

如果您想要一个更优雅的解决方案,您将多次使用 RegEx 表达式。大多数语言都使用它们,它们对于更复杂的替换等非常有用

回答by unutbu

A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translatemethod. Finally, you can produce your list simply by splitting on whitespace using str.split:

剧本应该足够短,可以一口气读入记忆。如果是这样,您可以使用该translate方法删除所有标点。最后,您可以简单地通过使用str.split以下方法在空格上拆分来生成您的列表:

import string

with open('screenplay.txt', 'rb') as f:
    content = f.read()
    content = content.translate(None, string.punctuation).lower()
    words = content.split()

print words

Note that this will change Mr.Smithinto mrsmith. If you'd like it to become ['mr', 'smith']then you could replace all punctation with spaces, and then use str.split:

请注意,这将改变Mr.Smithmrsmith。如果您希望它成为['mr', 'smith']那么您可以用空格替换所有标点,然后使用str.split

def using_translate(content):
    table = string.maketrans(
        string.punctuation,
        ' '*len(string.punctuation))
    content = content.translate(table).lower()
    words = content.split()
    return words


One problem you might encounter using a positive regex pattern such as [a-z]+is that it will only match ascii characters. If the file has accented characters, the words would get split apart. Gruyèrewould become ['Gruy','re'].

使用正正则表达式模式时可能会遇到的一个问题[a-z]+是它只能匹配 ascii 字符。如果文件包含重音字符,则单词会分开。 Gruyère会变成['Gruy','re'].

You could fix that by using re.splitto split on punctuation. For example,

您可以通过使用re.split拆分标点符号来解决该问题。例如,

def using_re(content):
    words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
    return words

However, using str.translateis faster:

但是,使用str.translate速度更快:

In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop

In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop

回答by 6502

You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)

您可以使用一个简单的正则表达式来创建一个包含所有单词的集合(一个或多个字母字符的序列)

import re
words = set(re.findall("[a-z]+", f.read().lower()))

Using a seteach word will be included just once.

使用 aset每个单词将只包含一次。

Just using findallwill instead give you all the words in order.

只需使用findallwill 代替按顺序为您提供所有单词。

回答by Brionius

This is a job for regular expressions!

这是正则表达式的工作!

For example:

例如:

import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words

回答by Tiago Martins

You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.

您可以使用字典来指定您不想要的字符,并根据您的选择格式化当前字符串。

replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())

Output:

输出:

abc321
cda123

回答by Colonel Panic

Try the algorithm from https://stackoverflow.com/a/17951315/284795, ie. split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

尝试来自https://stackoverflow.com/a/17951315/284795的算法,即。在空白处拆分文本,然后修剪标点符号。这会小心地从单词边缘删除标点符号,而不会损坏单词内的撇号,例如we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

You might want to add a .lower()

您可能想添加一个 .lower()

回答by MatLecu

You can try something like this. Probably need some work on the regexp though.

你可以尝试这样的事情。不过可能需要在正则表达式上做一些工作。

import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())