Python 使用 NLTK 创建新语料库

Question

提问by alvas

I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK bookbut it doesn't give the answer. I'm kind of new to Python.

我认为我的标题的答案通常是去阅读文档，但是我浏览了NLTK 书，但它没有给出答案。我对 Python 有点陌生。

I have a bunch of .txtfiles and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data.

我有一堆.txt文件，我希望能够使用 NLTK 为 corpus 提供的语料库函数nltk_data。

I've tried PlaintextCorpusReaderbut I couldn't get further than:

我试过了，PlaintextCorpusReader但我无法超越：

>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()

How do I segment the newcorpussentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReaderclass?

如何newcorpus使用 punkt分割句子？我尝试使用 punkt 函数，但 punkt 函数无法读取PlaintextCorpusReader类？

Can you also lead me to how I can write the segmented data into text files?

您能否也指导我如何将分段数据写入文本文件？

Answer 1

采纳答案by Reiner Gerecke

I think the PlaintextCorpusReaderalready segments the input with a punkt tokenizer, at least if your input language is english.

我认为PlaintextCorpusReader已经使用 punkt 标记器对输入进行了分割，至少如果您的输入语言是英语。

PlainTextCorpusReader's constructor

PlainTextCorpusReader 的构造函数

def __init__(self, root, fileids,
             word_tokenizer=WordPunctTokenizer(),
             sent_tokenizer=nltk.data.LazyLoader(
                 'tokenizers/punkt/english.pickle'),
             para_block_reader=read_blankline_block,
             encoding='utf8'):

You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle').

您可以向读者传递一个单词和句子标记器，但对于后者，默认值已经是nltk.data.LazyLoader('tokenizers/punkt/english.pickle').

For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).

对于单个字符串，分词器将按如下方式使用（解释here，请参阅第 5 节了解 punkt 分词器）。

>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())

Answer 2

回答by Krolique

 >>> import nltk
 >>> from nltk.corpus import PlaintextCorpusReader
 >>> corpus_root = './'
 >>> newcorpus = PlaintextCorpusReader(corpus_root, '.*')
 """
 if the ./ dir contains the file my_corpus.txt, then you 
 can view say all the words it by doing this 
 """
 >>> newcorpus.words('my_corpus.txt')

Answer 3

回答by alvas

After some years of figuring out how it works, here's the updated tutorial of

经过几年弄清楚它是如何工作的，这里是更新的教程

How to create an NLTK corpus with a directory of textfiles?

如何使用文本文件目录创建 NLTK 语料库？

The main idea is to make use of the nltk.corpus.readerpackage. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.

主要思想是利用nltk.corpus.reader包。如果您有一个英文文本文件目录，最好使用PlaintextCorpusReader。

If you have a directory that looks like this:

如果您有一个如下所示的目录：

newcorpus/
         file1.txt
         file2.txt
         ...

Simply use these lines of code and you can get a corpus:

只需使用这些代码行，您就可以获得一个语料库：

import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

corpusdir = 'newcorpus/' # Directory of corpus.

newcorpus = PlaintextCorpusReader(corpusdir, '.*')

NOTE:that the PlaintextCorpusReaderwill use the default nltk.tokenize.sent_tokenize()and nltk.tokenize.word_tokenize()to split your texts into sentences and words and these functions are build for English, it may NOTwork for all languages.

注：该PlaintextCorpusReader会使用默认的nltk.tokenize.sent_tokenize()和nltk.tokenize.word_tokenize()你的文章分成句子和单词和这些功能都建立英语中，它可能不是所有的语言工作。

Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:

这是创建测试文本文件以及如何使用 NLTK 创建语料库以及如何在不同级别访问语料库的完整代码：

import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]

# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)

# Output the files into the directory.
filename = 0
for text in corpus:
    filename+=1
    with open(corpusdir+str(filename)+'.txt','w') as fout:
        print>>fout, text

# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
    assert open(corpusdir+infile,'r').read().strip() == text.strip()


# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')

# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
    print infile # The fileids of each file.
    with newcorpus.open(infile) as fin: # Opens the file.
        print fin.read().strip() # Prints the content of the file
print

# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()
print 

# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and 
#       nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print

# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])

# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print

# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])

# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()

# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])

Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenizationand sentence tokenizationmodules that takes string/basestring input and produces such output:

最后，要读取文本目录并以其他语言创建 NLTK 语料库，您必须首先确保您有一个 Python 可调用的单词标记化和句子标记化模块，它们接受字符串/基本字符串输入并产生这样的输出：

>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
>>> sent_tokenize(txt1)
['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']
>>> word_tokenize(sent_tokenize(txt1)[0])
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']

Python 使用 NLTK 创建新语料库

提问by alvas

采纳答案by Reiner Gerecke

回答by Krolique

回答by alvas

相关推荐

最近更新

标签

Python 使用 NLTK 创建新语料库

提问by alvas

采纳答案by Reiner Gerecke

回答by Krolique

回答by alvas

相关推荐

Python 列包含子字符串的 SQLAlchemy 查询

Python 二进制序列 x 位长的所有排列

python，格式字符串

如何单步调试 Python 代码以帮助调试问题？

相关推荐

最近更新

标签