Python 将 word2vec bin 文件转换为文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27324292/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert word2vec bin file to text
提问by Glenn
From the word2vecsite I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures usthat "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c.
我可以从word2vec站点下载 GoogleNews-vectors-negative300.bin.gz。.bin 文件(大约 3.4GB)是一种对我没有用的二进制格式。Tomas Mikolov向我们保证“将二进制格式转换为文本格式应该相当简单(尽管这会占用更多磁盘空间)。检查距离工具中的代码,读取二进制文件相当简单。” 不幸的是,我不知道足够的 C 来理解http://word2vec.googlecode.com/svn/trunk/distance.c。
Supposedly gensimcan do this also, but all the tutorials I've found seem to be about converting fromtext, not the other way.
据说gensim也可以做到这一点,但我发现的所有教程似乎都是关于从文本转换的,而不是其他方式。
Can someone suggest modifications to the C code or instructions for gensim to emit text?
有人可以建议修改 C 代码或 gensim 发出文本的指令吗?
采纳答案by Glenn
On the word2vec-toolkit mailing list Thomas Mensink has provided an answerin the form of a small C program that will convert a .bin file to text. This is a modification of the distance.c file. I replaced the original distance.c with Thomas's code below and rebuilt word2vec (make clean; make), and renamed the compiled distance to readbin. Then ./readbin vector.bin
will create a text version of vector.bin.
在 word2vec-toolkit 邮件列表中,Thomas Mensink以一个小型 C 程序的形式提供了一个答案,该程序会将 .bin 文件转换为文本。这是对 distance.c 文件的修改。我用下面Thomas的代码替换了原来的distance.c并重建了word2vec(make clean; make),并将编译后的distance重命名为readbin。然后./readbin vector.bin
将创建 vector.bin 的文本版本。
// Copyright 2013 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>
const long long max_size = 2000; // max length of strings
const long long N = 40; // number of closest words that will be shown
const long long max_w = 50; // max length of vocabulary entries
int main(int argc, char **argv) {
FILE *f;
char file_name[max_size];
float len;
long long words, size, a, b;
char ch;
float *M;
char *vocab;
if (argc < 2) {
printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
return 0;
}
strcpy(file_name, argv[1]);
f = fopen(file_name, "rb");
if (f == NULL) {
printf("Input file not found\n");
return -1;
}
fscanf(f, "%lld", &words);
fscanf(f, "%lld", &size);
vocab = (char *)malloc((long long)words * max_w * sizeof(char));
M = (float *)malloc((long long)words * (long long)size * sizeof(float));
if (M == NULL) {
printf("Cannot allocate memory: %lld MB %lld %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
return -1;
}
for (b = 0; b < words; b++) {
fscanf(f, "%s%c", &vocab[b * max_w], &ch);
for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
len = 0;
for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
len = sqrt(len);
for (a = 0; a < size; a++) M[a + b * size] /= len;
}
fclose(f);
//Code added by Thomas Mensink
//output the vectors of the binary format in text
printf("%lld %lld #File: %s\n",words,size,file_name);
for (a = 0; a < words; a++){
printf("%s ",&vocab[a * max_w]);
for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }
printf("\b\b\n");
}
return 0;
}
I removed the "\b\b" from the printf
.
我从printf
.
By the way, the resulting text file still contained the text word and some unnecessary whitespace which I did not want for some numerical calculations. I removed the initial text column and the trailing blank from each line with bash commands.
顺便说一下,生成的文本文件仍然包含文本单词和一些不必要的空格,我不想要一些数值计算。我使用 bash 命令从每一行中删除了初始文本列和尾随空白。
cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt
回答by zaytsev
I am using gensim to work with the GoogleNews-vectors-negative300.bin and I am including a binary = True
flag while loading the model.
我正在使用 gensim 来处理 GoogleNews-vectors-negative300.bin,并且binary = True
在加载模型时包含了一个标志。
from gensim import word2vec
model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
Seems to be working fine.
似乎工作正常。
回答by David Przybilla
I had a similar issue, I wanted to get bin/non-bin(gensim) models output as CSV.
我有一个类似的问题,我想将 bin/non-bin(gensim) 模型输出为 CSV。
here is the code which does that on python, it assumes you have gensim installed:
这是在python上执行此操作的代码,它假定您已安装gensim:
回答by dbao50
the format is IEEE 754 single-precision binary floating-point format: binary32 http://en.wikipedia.org/wiki/Single-precision_floating-point_formatThey use little-endian.
格式是IEEE 754单精度二进制浮点格式:binary32 http://en.wikipedia.org/wiki/Single-precision_floating-point_format他们使用little-endian。
Let do an example:
举个例子:
- First line is string format: "3000000 300\n" (vocabSize & vecSize, getByte till byte=='\n')
Next line include the vocab word first, and then (300*4 byte of float value, 4 byte for each dimension):
getByte till byte==32 (space). (60 47 115 62 32 => <\s>[space])
then each next 4 byte will represent one float number
next 4 byte: 0 0 -108 58 => 0.001129150390625.
- 第一行是字符串格式:"3000000 300\n" (vocabSize & vecSize, getByte until byte=='\n')
下一行首先包括词汇单词,然后是(300*4 字节的浮点值,每个维度 4 字节):
getByte till byte==32 (space). (60 47 115 62 32 => <\s>[space])
那么接下来的每个 4 字节将代表一个浮点数
下一个 4 字节:0 0 -108 58 => 0.001129150390625。
You can check the wikipedia link to see how, let me do this one as example:
您可以查看维基百科链接以了解如何操作,让我以这个为例:
(little-endian -> reverse order) 00111010 10010100 00000000 00000000
(little-endian -> 逆序) 00111010 10010100 00000000 00000000
- first is sign bit => sign = 1 (else = -1)
- next 8 bits => 117 => exp = 2^(117-127)
- next 23 bits => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)
- 首先是符号位 => 符号 = 1(否则 = -1)
- 接下来的 8 位 => 117 => exp = 2^(117-127)
- 接下来的 23 位 => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)
value = sign * exp * pre
值 = 符号 * exp * 前
回答by batgirl
You can load the binary file in word2vec, and then save the text version like this:
你可以在word2vec中加载二进制文件,然后像这样保存文本版本:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save("file.txt")
`
`
回答by Franck Dernoncourt
Here is the code I use:
这是我使用的代码:
import codecs
from gensim.models import Word2Vec
def main():
path_to_model = 'GoogleNews-vectors-negative300.bin'
output_file = 'GoogleNews-vectors-negative300_test.txt'
export_to_file(path_to_model, output_file)
def export_to_file(path_to_model, output_file):
output = codecs.open(output_file, 'w' , 'utf-8')
model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')
vocab = model.vocab
for mid in vocab:
#print(model[mid])
#print(mid)
vector = list()
for dimension in model[mid]:
vector.append(str(dimension))
#line = { "mid": mid, "vector": vector }
vector_str = ",".join(vector)
line = mid + "\t" + vector_str
#line = json.dumps(line)
output.write(line + "\n")
output.close()
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
回答by silo
I use this code to load binary model, then save the model to text file,
我使用此代码加载二进制模型,然后将模型保存到文本文件,
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
Note:
笔记:
Above code is for newversion of gensim. For previousversion, I used this code:
以上代码适用于新版本的gensim。对于以前的版本,我使用了以下代码:
from gensim.models import word2vec
model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)
回答by Hamed Ganji
convertvecis a small tool to convert vectors between different formats for the word2vec library.
convertvec是 word2vec 库中用于在不同格式之间转换向量的小工具。
Convert vectors from binary to plain text:
将向量从二进制转换为纯文本:
./convertvec bin2txt input.bin output.txt
./convertvec bin2txt input.bin output.txt
Convert vectors from plain text to binary:
将向量从纯文本转换为二进制:
./convertvec txt2bin input.txt output.bin
./convertvec txt2bin input.txt output.bin
回答by Yohanes Gultom
Just a quick update as now there is easier way.
只是快速更新,因为现在有更简单的方法。
If you are using word2vec
from https://github.com/dav/word2vecthere is additional option called -binary
which accept 1
to generate binary file or 0
to generate text file. This example comes from demo-word.sh
in the repo:
如果您word2vec
从https://github.com/dav/word2vec使用,则有一个额外的选项被调用-binary
,它接受1
生成二进制文件或0
生成文本文件。这个例子来自demo-word.sh
于 repo:
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15
回答by Raphael Schumann
If you get the Error:
如果您收到错误:
ImportError: No module named models.word2vec
then it is because there was an API update. This will work:
那是因为有一个 API 更新。这将起作用:
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('./GoogleNews-vectors-negative300.txt', binary=False)