使用 Pandas 和 PyMongo 将 MongoDB 数据加载到 DataFrame 的更好方法?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24963062/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:17:42  来源:igfitidea点击:

A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

pythonpandaspymongo

提问by blue_chip

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

我有一个 0.7 GB 的 MongoDB 数据库,其中包含我试图加载到数据帧中的推文。但是,我收到一个错误。

MemoryError:    

My code looks like this:

我的代码如下所示:

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

我已经尝试了以下答案中的方法,这些方法有时会在加载之前创建数据库所有元素的列表。

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

但是,在另一个讨论 list() 的答案中,该人说它适用于小数据集,因为所有内容都已加载到内存中。

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

就我而言,我认为这是错误的根源。加载到内存中的数据太多。我还可以使用什么其他方法?

采纳答案by blue_chip

I've modified my code to the following:

我已将代码修改为以下内容:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fieldsparameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

通过在 find() 函数中添加fields参数,我限制了输出。这意味着我没有将每个字段加载到 DataFrame 中,而只是将选定的字段加载到 DataFrame 中。现在一切正常。

回答by shx2

The fastest, and likely most memory-efficient way, to create a DataFrame from a mongodb query, as in your case, would be using monary.

从 mongodb 查询创建 DataFrame 的最快,可能也是最节省内存的方法,就像你的情况一样,是使用monary

This posthas a nice and concise explanation.

这篇文章有一个很好的简洁的解释。

回答by Yayati Sule

an elegant way of doing it would be as follows:

一种优雅的方法如下:

import pandas as pd
def my_transform_logic(x):
    if x :
        do_something
        return result

def process(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors and the processing finished in 2 hrs.

我在上面的代码中使用 Joblib 在一个有 260 万条记录的 mongoDB 集合上尝试了上述过程。我的代码没有抛出任何内存错误,处理在 2 小时内完成。

回答by Edgar Ramírez Mondragón

The from_recordsclassmethodis probably the best way to do it:

这可能是最好的方法:from_recordsclassmethod

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)