Python Gensim Word2Vec-IGI

时间：2020-02-23 14:42:45 　来源:igfitidea点击:

Gensim是一个开源矢量空间和主题建模工具包。
它是用Python实现的，并使用NumPy和SciPy。
它还使用Cython来提高性能。

1. Python Gensim模块

Gensim设计用于数据流传输，处理大型文本集和高效的增量算法或者使用简单的语言-Gensim设计用于以最高效，最轻松的方式自动从文档中提取语义主题。

实际上，这与其他产品有所不同，因为它们中的大多数仅针对内存和批处理。
作为Gensim非监督算法(例如潜在语义分析)的核心，潜在狄利克雷分配检查了一组训练文档中的单词统计共现模式，以发现文档的语义结构。

2.为什么使用Gensim？

Gensim具有各种功能，使其比其他科学软件包更具优势，例如：

不依赖于内存–您不需要整个训练语料库就可以在给定的时间驻留在内存中，这意味着它可以轻松处理大型的网络级语料库。
它提供了几种流行数据格式的I/O包装器和转换器。
Gensim可以有效地实现各种矢量空间算法，包括Tf-Idf，分布式增量式潜在Dirichlet分配(LDA)或者随机投影，分布式增量式潜在语义分析，而且添加新的算法确实非常容易。
它还以语义表示为文档提供相似性查询。

3. Gensim入门

在开始使用Gensim之前，您需要检查您的机器是否准备就绪可以使用它。
Gensim假定以下各项可在您的计算机上无缝运行：

Python 2.6或者更高版本
Numpy 1.3或者更高版本
Scipy 0.7或者更高版本

3.1)安装Gensim库

满足上述要求后，即可开始使用gensim设备。
您可以使用pip获得它。
只需转到终端并运行以下命令：

sudo pip install --upgrade gensim

3.2)使用Gensim

您可以像导入其他软件包一样将gensim导入任何python脚本中。
只需使用以下导入：

import gensim

3.3)开发Gensim Word2Vec嵌入

在介绍Gensim时，我们讨论了很多有关文本，单词和矢量的内容，让我们从开发word 2矢量嵌入开始：

from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

让我们运行代码，我们期望每个单词都有矢量：

3.4)可视化单词嵌入

我们在训练数据中可以看到每个单词的多个向量，这绝对很难理解。
在这种情况下，可视化可以帮助我们：

from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

让我们运行该程序，看看是否能得到更简单并且可以轻松理解的内容：

3.5)加载Google的Word2Vec嵌入

对于NLP应用程序而言，使用现有的预训练数据可能不是最佳方法，但是此时训练您自己的数据确实是一项耗时且困难的任务，因为这当然需要大量的计算机内存和时间。
因此，在此示例中，我们使用Google的数据。
对于此示例，您需要一个文件，可以在此处找到。

下载文件，解压缩，我们将其中使用二进制文件。

这是一个示例程序：

from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

上面的示例将google的单词加载到vec数据中，然后计算出" king-man + woman =？"。
我们应该期望以下几点：

[('queen', 0.7118192315101624)]

3.6)载入斯坦福的GloVe嵌入

还有另一种可用于将单词转换为矢量的算法，通常被称为用于单词表示的全局矢量或者GloVe。
在下一个示例中，我们将使用它们。

由于我们使用的是现有数据，因此我们需要一个相对较小的文件，可以从此处下载。

首先，我们需要将文件转换为word到vec格式，这可以通过以下方式完成：

from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

完成此操作后，我们准备继续以下示例：

# load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

Python Gensim Word2Vec

1. Python Gensim模块

2.为什么使用Gensim？

3. Gensim入门

3.1)安装Gensim库

3.2)使用Gensim

3.3)开发Gensim Word2Vec嵌入

3.4)可视化单词嵌入

3.5)加载Google的Word2Vec嵌入

3.6)载入斯坦福的GloVe嵌入

相关推荐

最近更新

标签

Python Gensim Word2Vec

1. Python Gensim模块

2.为什么使用Gensim？

3. Gensim入门

3.1)安装Gensim库

3.2)使用Gensim

3.3)开发Gensim Word2Vec嵌入

3.4)可视化单词嵌入

3.5)加载Google的Word2Vec嵌入

3.6)载入斯坦福的GloVe嵌入

相关推荐

Python除法

Python divmod()

Python –从URL下载文件

Python枚举类

相关推荐

最近更新

标签