如何在python中将文本文件拆分为其单词?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19720311/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to split a text file to its words in python?
提问by MACEE
I am very new to python and also didn't work with text before...I have 100 text files, each has around 100 to 150 lines of unstructured text describing patient's condition. I read one file in python using:
我对 python 非常陌生,之前也没有使用过文本……我有 100 个文本文件,每个文件都有大约 100 到 150 行描述患者病情的非结构化文本。我使用以下命令在 python 中读取了一个文件:
with open("C:\...\...\...\record-13.txt") as f:
content = f.readlines()
print (content)
Now I can split each line of this file to its words using for example:
现在我可以使用例如:
a = content[0].split()
print (a)
but I don't know how to split whole file to words? do loops (while or for) help with that?
但我不知道如何将整个文件拆分为单词?循环(while 或 for)对此有帮助吗?
Thank you for your help guys. Your answers help me to write this (in my file, words are split by space so that's delimiter I think!):
谢谢你们的帮助。你的回答帮助我写这个(在我的文件中,单词按空格分开,所以我认为这是分隔符!):
with open ("C:\...\...\...\record-13.txt") as f:
lines = f.readlines()
for line in lines:
words = line.split()
for word in words:
print (word)
that simply splits words by line (one word in one line).
简单地逐行拆分单词(一行一个单词)。
采纳答案by Travis Griggs
Nobody has suggested a generator, I'm surprised. Here's how I would do it:
没有人建议发电机,我很惊讶。这是我将如何做到的:
def words(stringIterable):
#upcast the argument to an iterator, if it's an iterator already, it stays the same
lineStream = iter(stringIterable)
for line in lineStream: #enumerate the lines
for word in line.split(): #further break them down
yield word
Now this can be used both on simple lists of sentences that you might have in memory already:
现在,这可以用于您可能已经在记忆中的简单句子列表:
listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
print(word)
But it will work just as well on a file, without needing to read the whole file in memory:
但它在文件上也能正常工作,而无需读取内存中的整个文件:
with open('words.py', 'r') as myself:
for word in words(myself):
print(word)
回答by Paul Draper
with open("C:\...\...\...\record-13.txt") as f:
for line in f:
for word in line.split():
print word
Or, this gives you a list of words
或者,这会给你一个单词列表
with open("C:\...\...\...\record-13.txt") as f:
words = [word for line in f for word in line.split()]
Or, this gives you a list of lines, but with each line as a list of words.
或者,这会给你一个行列表,但每行都是一个单词列表。
with open("C:\...\...\...\record-13.txt") as f:
words = [line.split() for line in f]
回答by tobyodavies
The most flexible approach is to use list comprehension to generate a list of words:
最灵活的方法是使用列表理解来生成单词列表:
with open("C:\...\...\...\record-13.txt") as f:
words = [word
for line in f
for word in line.split()]
# Do what you want with the words list
Which you can then iterate over, add to a collections.Counter
or anything else you please.
然后你可以迭代它,添加到 acollections.Counter
或任何你喜欢的东西。
回答by starrify
It depends on how you define words
, or what you regard as the delimiters
.
Notice string.split
in Python receives an optional parameter delimiter
, so you could pass it as this:
这取决于您如何定义words
,或者您认为delimiters
.
注意string.split
Python 中接收一个可选参数delimiter
,因此您可以将其传递如下:
for lines in content[0].split():
for word in lines.split(','):
print(word)
Unfortunately, string.split
receives a single delimiter only, so you may need multi-level splitting like this:
不幸的是,只string.split
接收一个分隔符,因此您可能需要像这样的多级拆分:
for lines in content[0].split():
for split0 in lines.split(' '):
for split1 in split0.split(','):
for split2 in split1.split('.'):
for split3 in split2.split('?'):
for split4 in split3.split('!'):
for word in split4.split(':'):
if word != "":
print(word)
Looks ugly, right? Luckily we can use iteration instead:
看起来很丑对吧?幸运的是,我们可以使用迭代来代替:
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
new_words = []
for word in words:
new_words += word.split(delimiter)
words = new_words
EDITED:Or simply we could use the regular expression package:
编辑:或者干脆我们可以使用正则表达式包:
import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)
回答by Bruno
I would use Natural Language Tool Kitas the split()
way does not deal well with punctuation.
我会使用自然语言工具包,因为这种split()
方式不能很好地处理标点符号。
import nltk
for line in file:
words = nltk.word_tokenize(line)