Python 从没有空格、标点符号的文本文件中创建每个单词的列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18135967/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Creating a list of every word from a text file without spaces, punctuation
提问by Tom F
I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.
我有一个长文本文件(剧本)。我想把这个文本文件变成一个列表(每个单词都被分隔开),以便我以后可以搜索它。
The code i have at the moment is
我目前的代码是
file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words
I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)
我认为这可以将所有单词分成一个列表,但是我无法删除单词末尾的所有额外内容,例如逗号和句号。我还想让大写字母小写(因为我希望能够以小写字母搜索并同时显示大写和小写单词)。任何帮助都会很棒:)
回答by Brian H
Use the replace method.
使用替换方法。
mystring = mystring.replace(",", "")
If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such
如果您想要一个更优雅的解决方案,您将多次使用 RegEx 表达式。大多数语言都使用它们,它们对于更复杂的替换等非常有用
回答by unutbu
A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate
method. Finally, you can produce your list simply by splitting on whitespace using str.split
:
剧本应该足够短,可以一口气读入记忆。如果是这样,您可以使用该translate
方法删除所有标点。最后,您可以简单地通过使用str.split
以下方法在空格上拆分来生成您的列表:
import string
with open('screenplay.txt', 'rb') as f:
content = f.read()
content = content.translate(None, string.punctuation).lower()
words = content.split()
print words
Note that this will change Mr.Smith
into mrsmith
. If you'd like it to become ['mr', 'smith']
then you could replace all punctation with spaces, and then use str.split
:
请注意,这将改变Mr.Smith
成mrsmith
。如果您希望它成为['mr', 'smith']
那么您可以用空格替换所有标点,然后使用str.split
:
def using_translate(content):
table = string.maketrans(
string.punctuation,
' '*len(string.punctuation))
content = content.translate(table).lower()
words = content.split()
return words
One problem you might encounter using a positive regex pattern such as [a-z]+
is that it will only match ascii characters. If the file has accented characters, the words would get split apart.
Gruyère
would become ['Gruy','re']
.
使用正正则表达式模式时可能会遇到的一个问题[a-z]+
是它只能匹配 ascii 字符。如果文件包含重音字符,则单词会分开。
Gruyère
会变成['Gruy','re']
.
You could fix that by using re.split
to split on punctuation.
For example,
您可以通过使用re.split
拆分标点符号来解决该问题。例如,
def using_re(content):
words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
return words
However, using str.translate
is faster:
但是,使用str.translate
速度更快:
In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop
In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop
回答by 6502
You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)
您可以使用一个简单的正则表达式来创建一个包含所有单词的集合(一个或多个字母字符的序列)
import re
words = set(re.findall("[a-z]+", f.read().lower()))
Using a set
each word will be included just once.
使用 aset
每个单词将只包含一次。
Just using findall
will instead give you all the words in order.
只需使用findall
will 代替按顺序为您提供所有单词。
回答by Brionius
This is a job for regular expressions!
这是正则表达式的工作!
For example:
例如:
import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words
回答by Tiago Martins
You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.
您可以使用字典来指定您不想要的字符,并根据您的选择格式化当前字符串。
replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())
Output:
输出:
abc321
cda123
回答by Colonel Panic
Try the algorithm from https://stackoverflow.com/a/17951315/284795, ie. split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're
.
尝试来自https://stackoverflow.com/a/17951315/284795的算法,即。在空白处拆分文本,然后修剪标点符号。这会小心地从单词边缘删除标点符号,而不会损坏单词内的撇号,例如we're
.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
You might want to add a .lower()
您可能想添加一个 .lower()
回答by MatLecu
You can try something like this. Probably need some work on the regexp though.
你可以尝试这样的事情。不过可能需要在正则表达式上做一些工作。
import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())