如何标记python输入文件中的自然英文文本?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/12703842/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:39:16  来源:igfitidea点击:

How to tokenize natural English text in an input file in python?

pythonnltk

提问by Target

I want to tokenize input file in pythonplease suggest me i am new user of python .

我想tokenize input file in python请建议我是 python 的新用户。

I read the some thng about the regular expression but still some confusion so please suggest any link or code overview for the same.

我阅读了关于正则表达式的一些内容,但仍然有些困惑,所以请建议任何链接或代码概述。

回答by del

Try something like this:

尝试这样的事情:

import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens

The NLTK tutorial is also full of easy to follow examples: http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html

NLTK 教程也充满了易于理解的示例:http: //nltk.googlecode.com/svn/trunk/doc/book/ch03.html

回答by alvas

Using NLTK

使用 NLTK

If your file is small:

如果您的文件很小:

  • Open the file with the context manager with open(...) as x,
  • then do a .read()and tokenize it with word_tokenize()
  • 使用上下文管理器打开文件with open(...) as x
  • 然后做一个.read()并标记它word_tokenize()

[code]:

[代码]:

from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
    tokens = word_tokenize(fin.read())

If your file is larger:

如果您的文件较大:

  • Open the file with the context manager with open(...) as x,
  • read the file line by line with a for-loop
  • tokenize the line with word_tokenize()
  • output to your desired format (with the write flag set)
  • 使用上下文管理器打开文件with open(...) as x
  • 使用 for 循环逐行读取文件
  • word_tokenize()
  • 输出到您想要的格式(设置写标志)

[code]:

[代码]:

from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt','w') as fout:
    for line in fin:
        tokens = word_tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)


Using SpaCy

使用 SpaCy

from __future__ import print_function
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
tokenizer = Tokenizer(nlp.vocab)

with open ('myfile.txt') as fin, open('tokens.txt') as fout:
    for line in fin:
        tokens = tokenizer.tokenize(line)
        print(' '.join(tokens), end='\n', file=fout)

回答by svp

with open ("file.txt", "r") as f1:
         data=str(f1.readlines())
         sent_tokenize(data)