如何在python中将文本文件拆分为其单词?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19720311/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:25:55  来源:igfitidea点击:

How to split a text file to its words in python?

python

提问by MACEE

I am very new to python and also didn't work with text before...I have 100 text files, each has around 100 to 150 lines of unstructured text describing patient's condition. I read one file in python using:

我对 python 非常陌生,之前也没有使用过文本……我有 100 个文本文件,每个文件都有大约 100 到 150 行描述患者病情的非结构化文本。我使用以下命令在 python 中读取了一个文件:

with open("C:\...\...\...\record-13.txt") as f:
    content = f.readlines()
    print (content) 

Now I can split each line of this file to its words using for example:

现在我可以使用例如:

a = content[0].split()
print (a)

but I don't know how to split whole file to words? do loops (while or for) help with that?

但我不知道如何将整个文件拆分为单词?循环(while 或 for)对此有帮助吗?



Thank you for your help guys. Your answers help me to write this (in my file, words are split by space so that's delimiter I think!):

谢谢你们的帮助。你的回答帮助我写这个(在我的文件中,单词按空格分开,所以我认为这是分隔符!):

with open ("C:\...\...\...\record-13.txt") as f:
  lines = f.readlines()
  for line in lines:
      words = line.split()
      for word in words:
          print (word)

that simply splits words by line (one word in one line).

简单地逐行拆分单词(一行一个单词)。

采纳答案by Travis Griggs

Nobody has suggested a generator, I'm surprised. Here's how I would do it:

没有人建议发电机,我很惊讶。这是我将如何做到的:

def words(stringIterable):
    #upcast the argument to an iterator, if it's an iterator already, it stays the same
    lineStream = iter(stringIterable)
    for line in lineStream: #enumerate the lines
        for word in line.split(): #further break them down
            yield word

Now this can be used both on simple lists of sentences that you might have in memory already:

现在,这可以用于您可能已经在记忆中的简单句子列表:

listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
    print(word)

But it will work just as well on a file, without needing to read the whole file in memory:

但它在文件上也能正常工作,而无需读取内存中的整个文件:

with open('words.py', 'r') as myself:
    for word in words(myself):
        print(word)

回答by Paul Draper

with open("C:\...\...\...\record-13.txt") as f:
    for line in f:
        for word in line.split():
            print word

Or, this gives you a list of words

或者,这会给你一个单词列表

with open("C:\...\...\...\record-13.txt") as f:
    words = [word for line in f for word in line.split()]

Or, this gives you a list of lines, but with each line as a list of words.

或者,这会给你一个行列表,但每行都是一个单词列表。

with open("C:\...\...\...\record-13.txt") as f:
    words = [line.split() for line in f]

回答by tobyodavies

The most flexible approach is to use list comprehension to generate a list of words:

最灵活的方法是使用列表理解来生成单词列表:

with open("C:\...\...\...\record-13.txt") as f:
    words = [word
             for line in f
             for word in line.split()]

# Do what you want with the words list

Which you can then iterate over, add to a collections.Counteror anything else you please.

然后你可以迭代它,添加到 acollections.Counter或任何你喜欢的东西。

回答by starrify

It depends on how you define words, or what you regard as the delimiters.
Notice string.splitin Python receives an optional parameter delimiter, so you could pass it as this:

这取决于您如何定义words,或者您认为delimiters.
注意string.splitPython 中接收一个可选参数delimiter,因此您可以将其传递如下:

for lines in content[0].split():
    for word in lines.split(','):
        print(word)

Unfortunately, string.splitreceives a single delimiter only, so you may need multi-level splitting like this:

不幸的是,只string.split接收一个分隔符,因此您可能需要像这样的多级拆分:

for lines in content[0].split():
    for split0 in lines.split(' '):
        for split1 in split0.split(','):
            for split2 in split1.split('.'):
                for split3 in split2.split('?'):
                    for split4 in split3.split('!'):
                        for word in split4.split(':'): 
                            if word != "":
                                print(word)

Looks ugly, right? Luckily we can use iteration instead:

看起来很丑对吧?幸运的是,我们可以使用迭代来代替:

delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
    new_words = []
    for word in words:
        new_words += word.split(delimiter)
    words = new_words

EDITED:Or simply we could use the regular expression package:

编辑:或者干脆我们可以使用正则表达式包:

import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)

回答by Bruno

I would use Natural Language Tool Kitas the split()way does not deal well with punctuation.

我会使用自然语言工具包,因为这种split()方式不能很好地处理标点符号。

import nltk

for line in file:
    words = nltk.word_tokenize(line)