Python 将 word2vec bin 文件转换为文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27324292/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:39:47  来源:igfitidea点击:

Convert word2vec bin file to text

pythoncgensimword2vec

提问by Glenn

From the word2vecsite I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures usthat "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c.

我可以从word2vec站点下载 GoogleNews-vectors-negative300.bin.gz。.bin 文件(大约 3.4GB)是一种对我没有用的二进制格式。Tomas Mikolov向我们保证“将二进制格式转换为文本格式应该相当简单(尽管这会占用更多磁盘空间)。检查距离工具中的代码,读取二进制文件相当简单。” 不幸的是,我不知道足够的 C 来理解http://word2vec.googlecode.com/svn/trunk/distance.c

Supposedly gensimcan do this also, but all the tutorials I've found seem to be about converting fromtext, not the other way.

据说gensim也可以做到这一点,但我发现的所有教程似乎都是关于文本转换的,而不是其他方式。

Can someone suggest modifications to the C code or instructions for gensim to emit text?

有人可以建议修改 C 代码或 gensim 发出文本的指令吗?

采纳答案by Glenn

On the word2vec-toolkit mailing list Thomas Mensink has provided an answerin the form of a small C program that will convert a .bin file to text. This is a modification of the distance.c file. I replaced the original distance.c with Thomas's code below and rebuilt word2vec (make clean; make), and renamed the compiled distance to readbin. Then ./readbin vector.binwill create a text version of vector.bin.

在 word2vec-toolkit 邮件列表中,Thomas Mensink以一个小型 C 程序的形式提供了一个答案,该程序会将 .bin 文件转换为文本。这是对 distance.c 文件的修改。我用下面Thomas的代码替换了原来的distance.c并重建了word2vec(make clean; make),并将编译后的distance重命名为readbin。然后./readbin vector.bin将创建 vector.bin 的文本版本。

//  Copyright 2013 Google Inc. All Rights Reserved.
//
//  Licensed under the Apache License, Version 2.0 (the "License");
//  you may not use this file except in compliance with the License.
//  You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
//  Unless required by applicable law or agreed to in writing, software
//  distributed under the License is distributed on an "AS IS" BASIS,
//  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
//  See the License for the specific language governing permissions and
//  limitations under the License.

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <malloc.h>

const long long max_size = 2000;         // max length of strings
const long long N = 40;                  // number of closest words that will be shown
const long long max_w = 50;              // max length of vocabulary entries

int main(int argc, char **argv) {
  FILE *f;
  char file_name[max_size];
  float len;
  long long words, size, a, b;
  char ch;
  float *M;
  char *vocab;
  if (argc < 2) {
    printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
    return 0;
  }
  strcpy(file_name, argv[1]);
  f = fopen(file_name, "rb");
  if (f == NULL) {
    printf("Input file not found\n");
    return -1;
  }
  fscanf(f, "%lld", &words);
  fscanf(f, "%lld", &size);
  vocab = (char *)malloc((long long)words * max_w * sizeof(char));
  M = (float *)malloc((long long)words * (long long)size * sizeof(float));
  if (M == NULL) {
    printf("Cannot allocate memory: %lld MB    %lld  %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
    return -1;
  }
  for (b = 0; b < words; b++) {
    fscanf(f, "%s%c", &vocab[b * max_w], &ch);
    for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);
    len = 0;
    for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
    len = sqrt(len);
    for (a = 0; a < size; a++) M[a + b * size] /= len;
  }
  fclose(f);
  //Code added by Thomas Mensink
  //output the vectors of the binary format in text
  printf("%lld %lld #File: %s\n",words,size,file_name);
  for (a = 0; a < words; a++){
    printf("%s ",&vocab[a * max_w]);
    for (b = 0; b< size; b++){ printf("%f ",M[a*size + b]); }
    printf("\b\b\n");
  }  

  return 0;
}

I removed the "\b\b" from the printf.

我从printf.

By the way, the resulting text file still contained the text word and some unnecessary whitespace which I did not want for some numerical calculations. I removed the initial text column and the trailing blank from each line with bash commands.

顺便说一下,生成的文本文件仍然包含文本单词和一些不必要的空格,我不想要一些数值计算。我使用 bash 命令从每一行中删除了初始文本列和尾随空白。

cut --complement -d ' ' -f 1 GoogleNews-vectors-negative300.txt > GoogleNews-vectors-negative300_tuples-only.txt
sed 's/ $//' GoogleNews-vectors-negative300_tuples-only.txt

回答by zaytsev

I am using gensim to work with the GoogleNews-vectors-negative300.bin and I am including a binary = Trueflag while loading the model.

我正在使用 gensim 来处理 GoogleNews-vectors-negative300.bin,并且binary = True在加载模型时包含了一个标志。

from gensim import word2vec

model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True) 

Seems to be working fine.

似乎工作正常。

回答by David Przybilla

I had a similar issue, I wanted to get bin/non-bin(gensim) models output as CSV.

我有一个类似的问题,我想将 bin/non-bin(gensim) 模型输出为 CSV。

here is the code which does that on python, it assumes you have gensim installed:

这是在python上执行此操作的代码,它假定您已安装gensim:

https://gist.github.com/dav009/10a742de43246210f3ba

https://gist.github.com/dav009/10a742de43246210f3ba

回答by dbao50

the format is IEEE 754 single-precision binary floating-point format: binary32 http://en.wikipedia.org/wiki/Single-precision_floating-point_formatThey use little-endian.

格式是IEEE 754单精度二进制浮点格式:binary32 http://en.wikipedia.org/wiki/Single-precision_floating-point_format他们使用little-endian。

Let do an example:

举个例子:

  • First line is string format: "3000000 300\n" (vocabSize & vecSize, getByte till byte=='\n')
  • Next line include the vocab word first, and then (300*4 byte of float value, 4 byte for each dimension):

    getByte till byte==32 (space). (60 47 115 62 32 => <\s>[space])
    
  • then each next 4 byte will represent one float number

    next 4 byte: 0 0 -108 58 => 0.001129150390625.

  • 第一行是字符串格式:"3000000 300\n" (vocabSize & vecSize, getByte until byte=='\n')
  • 下一行首先包括词汇单词,然后是(300*4 字节的浮点值,每个维度 4 字节):

    getByte till byte==32 (space). (60 47 115 62 32 => <\s>[space])
    
  • 那么接下来的每个 4 字节将代表一个浮点数

    下一个 4 字节:0 0 -108 58 => 0.001129150390625。

You can check the wikipedia link to see how, let me do this one as example:

您可以查看维基百科链接以了解如何操作,让我以这个为例:

(little-endian -> reverse order) 00111010 10010100 00000000 00000000

(little-endian -> 逆序) 00111010 10010100 00000000 00000000

  • first is sign bit => sign = 1 (else = -1)
  • next 8 bits => 117 => exp = 2^(117-127)
  • next 23 bits => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)
  • 首先是符号位 => 符号 = 1(否则 = -1)
  • 接下来的 8 位 => 117 => exp = 2^(117-127)
  • 接下来的 23 位 => pre = 0*2^(-1) + 0*2^(-2) + 1*2^(-3) + 1*2^(-5)

value = sign * exp * pre

值 = 符号 * exp * 前

回答by batgirl

You can load the binary file in word2vec, and then save the text version like this:

你可以在word2vec中加载二进制文件,然后像这样保存文本版本:

from gensim.models import word2vec
 model = word2vec.Word2Vec.load_word2vec_format('Path/to/GoogleNews-vectors-negative300.bin', binary=True)
 model.save("file.txt")

`

`

回答by Franck Dernoncourt

Here is the code I use:

这是我使用的代码:

import codecs
from gensim.models import Word2Vec

def main():
    path_to_model = 'GoogleNews-vectors-negative300.bin'
    output_file = 'GoogleNews-vectors-negative300_test.txt'
    export_to_file(path_to_model, output_file)


def export_to_file(path_to_model, output_file):
    output = codecs.open(output_file, 'w' , 'utf-8')
    model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
    print('done loading Word2Vec')
    vocab = model.vocab
    for mid in vocab:
        #print(model[mid])
        #print(mid)
        vector = list()
        for dimension in model[mid]:
            vector.append(str(dimension))
        #line = { "mid": mid, "vector": vector  }
        vector_str = ",".join(vector)
        line = mid + "\t"  + vector_str
        #line = json.dumps(line)
        output.write(line + "\n")
    output.close()

if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

回答by silo

I use this code to load binary model, then save the model to text file,

我使用此代码加载二进制模型,然后将模型保存到文本文件,

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

References: APIand nullege.

参考资料:APInullege

Note:

笔记:

Above code is for newversion of gensim. For previousversion, I used this code:

以上代码适用于版本的gensim。对于以前的版本,我使用了以下代码

from gensim.models import word2vec

model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False)

回答by Hamed Ganji

convertvecis a small tool to convert vectors between different formats for the word2vec library.

convertvec是 word2vec 库中用于在不同格式之间转换向量的小工具。

Convert vectors from binary to plain text:

将向量从二进制转换为纯文本:

./convertvec bin2txt input.bin output.txt

./convertvec bin2txt input.bin output.txt

Convert vectors from plain text to binary:

将向量从纯文本转换为二进制:

./convertvec txt2bin input.txt output.bin

./convertvec txt2bin input.txt output.bin

回答by Yohanes Gultom

Just a quick update as now there is easier way.

只是快速更新,因为现在有更简单的方法。

If you are using word2vecfrom https://github.com/dav/word2vecthere is additional option called -binarywhich accept 1to generate binary file or 0to generate text file. This example comes from demo-word.shin the repo:

如果您word2vechttps://github.com/dav/word2vec使用,则有一个额外的选项被调用-binary,它接受1生成二进制文件或0生成文本文件。这个例子来自demo-word.sh于 repo:

time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 0 -iter 15

回答by Raphael Schumann

If you get the Error:

如果您收到错误:

ImportError: No module named models.word2vec

then it is because there was an API update. This will work:

那是因为有一个 API 更新。这将起作用:

from gensim.models.keyedvectors import KeyedVectors

model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
model.save_word2vec_format('./GoogleNews-vectors-negative300.txt', binary=False)