使用 Pandas 和 PyMongo 将 MongoDB 数据加载到 DataFrame 的更好方法？

Question

提问by blue_chip

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

我有一个 0.7 GB 的 MongoDB 数据库，其中包含我试图加载到数据帧中的推文。但是，我收到一个错误。

MemoryError:

My code looks like this:

我的代码如下所示：

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

我已经尝试了以下答案中的方法，这些方法有时会在加载之前创建数据库所有元素的列表。

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

但是，在另一个讨论 list() 的答案中，该人说它适用于小数据集，因为所有内容都已加载到内存中。

https://stackoverflow.com/a/13215411/2297475

https://stackoverflow.com/a/13215411/2297475

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

就我而言，我认为这是错误的根源。加载到内存中的数据太多。我还可以使用什么其他方法？

Answer 1

采纳答案by blue_chip

I've modified my code to the following:

我已将代码修改为以下内容：

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fieldsparameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

通过在 find() 函数中添加fields参数，我限制了输出。这意味着我没有将每个字段加载到 DataFrame 中，而只是将选定的字段加载到 DataFrame 中。现在一切正常。

Answer 2

回答by shx2

The fastest, and likely most memory-efficient way, to create a DataFrame from a mongodb query, as in your case, would be using monary.

从 mongodb 查询创建 DataFrame 的最快，可能也是最节省内存的方法，就像你的情况一样，是使用monary。

This posthas a nice and concise explanation.

这篇文章有一个很好的简洁的解释。

Answer 3

回答by Yayati Sule

an elegant way of doing it would be as follows:

一种优雅的方法如下：

import pandas as pd
def my_transform_logic(x):
    if x :
        do_something
        return result

def process(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors and the processing finished in 2 hrs.

我在上面的代码中使用 Joblib 在一个有 260 万条记录的 mongoDB 集合上尝试了上述过程。我的代码没有抛出任何内存错误，处理在 2 小时内完成。

Answer 4

回答by Edgar Ramírez Mondragón

The from_recordsclassmethodis probably the best way to do it:

这可能是最好的方法：from_recordsclassmethod

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)

使用 Pandas 和 PyMongo 将 MongoDB 数据加载到 DataFrame 的更好方法？

提问by blue_chip

采纳答案by blue_chip

回答by shx2

回答by Yayati Sule

回答by Edgar Ramírez Mondragón

相关推荐

最近更新

标签

使用 Pandas 和 PyMongo 将 MongoDB 数据加载到 DataFrame 的更好方法？

提问by blue_chip

采纳答案by blue_chip

回答by shx2

回答by Yayati Sule

回答by Edgar Ramírez Mondragón

相关推荐

pandas 熊猫 - 非常非常慢

pandas 如果值出现在熊猫数据框的任何列中，如何打印行

pandas Groupby 给定所选 DataFrame 列值的百分位数

从协议缓冲区创建一个类似 Python 字典的对象，以在 Pandas 中使用

相关推荐

最近更新

标签