python 使用pickle将巨大的bigram字典保存到文件中

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2108293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 23:48:51  来源:igfitidea点击:

Saving huge bigram dictionary to file using pickle

pythonfiledictionarypickle

提问by Jo?o Portela

a friend of mine wrote this little progam. the textFileis 1.2GB in size (7 years worth of newspapers). He successfully manages to create the dictionary but he cannot write it to a file using pickle(program hangs).

我的一个朋友写了这个小程序。的textFile是1.2GB大小(报纸7年价值)。他成功地创建了字典,但他无法使用pickle(程序挂起)将其写入文件。

import sys
import string
import cPickle as pickle

biGramDict = {}

textFile = open(str(sys.argv[1]), 'r')
biGramDictFile = open(str(sys.argv[2]), 'w')


for line in textFile:
   if (line.find('<s>')!=-1):
      old = None
      for line2 in textFile:
         if (line2.find('</s>')!=-1):
            break
         else:
            line2=line2.strip()
            if line2 not in string.punctuation:
               if old != None:
                  if old not in biGramDict:
                     biGramDict[old] = {}
                  if line2 not in biGramDict[old]:
                     biGramDict[old][line2] = 0
                  biGramDict[old][line2]+=1
               old=line2

textFile.close()

print "going to pickle..."    
pickle.dump(biGramDict, biGramDictFile,2)

print "pickle done. now load it..."

biGramDictFile.close()
biGramDictFile = open(str(sys.argv[2]), 'r')

newBiGramDict = pickle.load(biGramDictFile)

thanks in advance.

提前致谢。

EDIT
for anyone interested i will briefly explain what this program does. assuming you have a file formated roughly like this:


对任何感兴趣的人进行编辑,我将简要解释该程序的作用。假设您有一个格式大致如下的文件:

<s>
Hello
,
World
!
</s>
<s>
Hello
,
munde
!
</s>
<s>
World
domination
.
</s>
<s>
Total
World
domination
!
</s>
  • <s>are sentences separators.
  • one word per line.
  • <s>是句子分隔符。
  • 每行一个字。

a biGramDictionary is generated for later use.
something like this:

生成一个 biGramDictionary 供以后使用。
像这样:

{
 "Hello": {"World": 1, "munde": 1}, 
 "World": {"domination": 2},
 "Total": {"World": 1},
}

hope this helps. right now the strategy changed to using mysql because sqlite just wasn't working (probably because of the size)

希望这可以帮助。现在策略更改为使用 mysql,因为 sqlite 无法正常工作(可能是因为大小)

回答by Wim

Pickle is only meant to write complete (small) objects. Your dictionary is a bit large to even hold in memory, you'd better use a database instead so you can store and retrieve entries one by one instead of all at once.

Pickle 仅用于编写完整的(小)对象。您的字典有点大,甚至无法保存在内存中,您最好改用数据库,这样您就可以一个一个地存储和检索条目,而不是一次全部存储和检索。

Some good and easily integratable singe-file database formats you can use from Python are SQLiteor one of the DBM variants. The last one acts just like a dictionary (i.e. you can read and write key/value-pairs) but uses the disk as storage rather than 1.2 GBs of memory.

您可以从 Python 使用的一些良好且易于集成的单一文件数据库格式是SQLiteDBM 变体之一。最后一个就像字典一样(即您可以读取和写入键/值对),但使用磁盘作为存储而不是 1.2 GB 的内存。

回答by Ryan Ginstrom

One solution is to use buzhuginstead of pickle. It's a pure Python solution, and retains very Pythonic syntax. I think of it as the next step up from shelve and their ilk. It will handle the data sizes you're talking about. Its size limit is 2 GB per field (each field is stored in a separate file).

一种解决方案是使用buzhug而不是 pickle。这是一个纯 Python 解决方案,并保留了非常 Pythonic 的语法。我认为这是从搁置和他们的同类产品中迈出的下一步。它将处理您正在谈论的数据大小。其大小限制为每个字段 2 GB(每个字段存储在单独的文件中)。

回答by Khelben

Do you really need the whole data in memory? You could split it in naive ways like one file for each year o each month if you want the dictionary/pickle approach.

你真的需要内存中的全部数据吗?如果您想要字典/pickle 方法,您可以以天真的方式拆分它,例如每年或每月一个文件。

Also, remember that the dictionaries are not sorted, you can have problems having to sort that ammount of data. In case you want to search or sort the data, of course...

另外,请记住字典没有排序,您可能会遇到必须对大量数据进行排序的问题。如果您想搜索或排序数据,当然...

Anyway, I think that the database approach commented before is the most flexible one, specially on the long run...

无论如何,我认为之前评论的数据库方法是最灵活的方法,特别是从长远来看......

回答by pi.

If your really, reallywant to use a dictionary like semantics, try SQLAlchemy's associationproxy. The following (rather long) piece of code translates your dictionary into Key,Value-Pairsin the entries-Table. I do not know how SQLAlchemy copes with your big dictionary, but SQLite should be able to handle it nicely.

如果您真的,真的想使用像语义这样的字典,请尝试 SQLAlchemy 的associationproxy. 以下(相当长的)代码将您的字典转换为 -Table中的Key,Value-Pairsentries。我不知道 SQLAlchemy 如何处理你的大字典,但 SQLite 应该能够很好地处理它。

from sqlalchemy import create_engine, MetaData
from sqlalchemy import Table, Column, Integer, ForeignKey, Unicode, UnicodeText
from sqlalchemy.orm import mapper, sessionmaker, scoped_session, Query, relation
from sqlalchemy.orm.collections import column_mapped_collection
from sqlalchemy.ext.associationproxy import association_proxy
from sqlalchemy.schema import UniqueConstraint

engine = create_engine('sqlite:///newspapers.db')

metadata = MetaData()
metadata.bind = engine

Session = scoped_session(sessionmaker(engine))
session = Session()

newspapers = Table('newspapers', metadata,
    Column('newspaper_id', Integer, primary_key=True),
    Column('newspaper_name', Unicode(128)),
)

entries = Table('entries', metadata,
    Column('entry_id', Integer, primary_key=True),
    Column('newspaper_id', Integer, ForeignKey('newspapers.newspaper_id')),
    Column('entry_key', Unicode(255)),
    Column('entry_value', UnicodeText),
    UniqueConstraint('entry_key', 'entry_value', name="pair"),
)

class Base(object):

    def __init__(self, **kw):
        for key, value in kw.items():
            setattr(self, key, value)

    query = Session.query_property(Query)

def create_entry(key, value):
    return Entry(entry_key=key, entry_value=value)

class Newspaper(Base):

    entries = association_proxy('entry_dict', 'entry_value',
        creator=create_entry)

class Entry(Base):
    pass

mapper(Newspaper, newspapers, properties={
    'entry_dict': relation(Entry,
        collection_class=column_mapped_collection(entries.c.entry_key)),
})
mapper(Entry, entries)

metadata.create_all()

dictionary = {
    u'foo': u'bar',
    u'baz': u'quux'
}

roll = Newspaper(newspaper_name=u"The Toilet Roll")
session.add(roll)
session.flush()

roll.entries = dictionary
session.flush()

for entry in Entry.query.all():
    print entry.entry_key, entry.entry_value
session.commit()

session.expire_all()

print Newspaper.query.filter_by(newspaper_id=1).one().entries

gives

foo bar
baz quux
{u'foo': u'bar', u'baz': u'quux'}