在python中读取文本文件并将其拆分为单个单词

Question

提问by Johnnerz

I have this text file made up of numbers and words, for example like this - 09807754 18 n 03 aristocrat 0 blue_blood 0 patricianand I want to split it so that each word or number will come up as a new line.

我有这个由数字和单词组成的文本文件，例如这样 -09807754 18 n 03 aristocrat 0 blue_blood 0 patrician我想拆分它，以便每个单词或数字都作为一个新行出现。

A whitespace separator would be ideal as I would like the words with the dashes to stay connected.

空格分隔符是理想的，因为我希望带破折号的单词保持连接。

This is what I have so far:

这是我到目前为止：

f = open('words.txt', 'r')
for word in f:
    print(word)

not really sure how to go from here, I would like this to be the output:

不太确定如何从这里开始，我希望这是输出：

09807754
18
n
3
aristocrat
...

Answer 1

采纳答案by dawg

Given this file:

鉴于此文件：

$ cat words.txt
line1 word1 word2
line2 word3 word4
line3 word5 word6

If you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file):

如果您一次只想要一个单词（忽略文件中空格与换行符的含义）：

with open('words.txt','r') as f:
    for line in f:
        for word in line.split():
           print(word)

Prints:

印刷：

line1
word1
word2
line2
...
word6

Similarly, if you want to flattenthe file into a single flat list of words in the file, you might do something like this:

同样，如果您想拼合文件到文件中的单词一个平面列表，你可以做这样的事情：

with open('words.txt') as f:
    flat_list=[word for line in f for word in line.split()]

>>> flat_list
['line1', 'word1', 'word2', 'line2', 'word3', 'word4', 'line3', 'word5', 'word6']

Which can create the same output as the first example with print '\n'.join(flat_list)...

它可以创建与第一个示例相同的输出print '\n'.join(flat_list)...

Or, if you want a nested list of the words in each line of the file (for example, to create a matrix of rows and columns from a file):

或者，如果您想要文件每一行中单词的嵌套列表（例如，从文件创建行和列的矩阵）：

with open('words.txt') as f:
    matrix=[line.split() for line in f]

>>> matrix
[['line1', 'word1', 'word2'], ['line2', 'word3', 'word4'], ['line3', 'word5', 'word6']]

If you want a regex solution, which would allow you to filter wordNvs lineNtype words in the example file:

如果你想要一个正则表达式解决方案，它允许你在示例文件中过滤wordNvslineN类型的单词：

import re
with open("words.txt") as f:
    for line in f:
        for word in re.findall(r'\bword\d+', line):
            # wordN by wordN with no lineN

Or, if you want that to be a line by line generator with a regex:

或者，如果您希望它成为带有正则表达式的逐行生成器：

 with open("words.txt") as f:
     (word for line in f for word in re.findall(r'\w+', line))

Answer 2

回答by dugres

f = open('words.txt')
for word in f.read().split():
    print(word)

Answer 3

回答by smac89

Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertoolsmodule:

这是我完全实用的方法，它避免了阅读和拆分行。它使用itertools模块：

Note for python 3, replace `itertools.imap`with `map`

注意python 3，替换`itertools.imap`为`map`

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
        itertools.takewhile(lambda c: bool(c),
            itertools.imap(mfile.read,
                itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

示例用法：

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python

It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

我想在你的情况下，这将是使用该功能的方式：

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)

Answer 4

回答by pambda

As supplementary, if you are reading a vvvvery large file, and you don't want read all of the content into memory at once, you might consider using a buffer, then return each word by yield:

作为补充，如果您正在读取一个非常大的文件，并且您不想一次将所有内容读入内存，您可以考虑使用buffer，然后通过 yield 返回每个单词：

def read_words(inputfile):
    with open(inputfile, 'r') as f:
        while True:
            buf = f.read(10240)
            if not buf:
                break

            # make sure we end on a space (word boundary)
            while not str.isspace(buf[-1]):
                ch = f.read(1)
                if not ch:
                    break
                buf += ch

            words = buf.split()
            for word in words:
                yield word
        yield '' #handle the scene that the file is empty

if __name__ == "__main__":
    for word in read_words('./very_large_file.txt'):
        process(word)

Answer 5

回答by Gaurav

What you can do is use nltk to tokenize words and then store all of the words in a list, here's what I did. If you don't know nltk; it stands for natural language toolkit and is used to process natural language. Here's some resource if you wanna get started [http://www.nltk.org/book/]

你可以做的是使用 nltk 来标记单词，然后将所有单词存储在一个列表中，这就是我所做的。如果你不知道 nltk; 它代表自然语言工具包，用于处理自然语言。如果你想开始，这里有一些资源 [ http://www.nltk.org/book/]

import nltk 
from nltk.tokenize import word_tokenize 
file = open("abc.txt",newline='')
result = file.read()
words = word_tokenize(result)
for i in words:
       print(i)

The output will be this:

输出将是这样的：

09807754
18
n
03
aristocrat
0
blue_blood
0
patrician

Answer 6

回答by mujad

with open(filename) as file:
    words = file.read().split()

Its a List of all words in your file.

它是文件中所有单词的列表。

import re
with open(filename) as file:
    words = re.findall(r"([a-zA-Z\-]+)", file.read())

在python中读取文本文件并将其拆分为单个单词

提问by Johnnerz

采纳答案by dawg

回答by dugres

回答by smac89

Note for python 3, replace `itertools.imap`with `map`

注意python 3，替换`itertools.imap`为`map`

回答by pambda

回答by Gaurav

回答by mujad

相关推荐

最近更新

标签

在python中读取文本文件并将其拆分为单个单词

提问by Johnnerz

采纳答案by dawg

回答by dugres

回答by smac89

Note for python 3, replace itertools.imapwith map

注意python 3，替换itertools.imap为map

回答by pambda

回答by Gaurav

回答by mujad

相关推荐

Python串口监听器

使用pdb调试Python时如何打印所有变量值，而不指定每个变量？

在python中计算数据帧的每一列中的非零值

如何从 bash shell 内联执行 Python

相关推荐

最近更新

标签

Note for python 3, replace `itertools.imap`with `map`

注意python 3，替换`itertools.imap`为`map`