Python 使用 PyMongo 将 Pandas Dataframe 插入 mongodb
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20167194/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Insert a Pandas Dataframe into mongodb using PyMongo
提问by Nyxynyx
What is the quickest way to insert a pandas DataFrame into mongodb using PyMongo?
使用 将 Pandas DataFrame 插入 mongodb 的最快方法是什么PyMongo?
Attempts
尝试
db.myCollection.insert(df.to_dict())
gave an error
出错了
InvalidDocument: documents must have only string keys, the key was Timestamp('2013-11-23 13:31:00', tz=None)
InvalidDocument: documents must have only string keys, the key was Timestamp('2013-11-23 13:31:00', tz=None)
db.myCollection.insert(df.to_json())
gave an error
出错了
TypeError: 'str' object does not support item assignment
TypeError: 'str' object does not support item assignment
db.myCollection.insert({id: df.to_json()})
gave an error
出错了
InvalidDocument: documents must have only string a keys, key was <built-in function id>
InvalidDocument: documents must have only string a keys, key was <built-in function id>
df
df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount 150 non-null values
price 150 non-null values
tid 150 non-null values
dtypes: float64(2), int64(1)
采纳答案by alko
I doubt there is a both quickestand simplemethod. If you don't worry about data conversion, you can do
我怀疑有一种既快速又简单的方法。如果你不担心数据转换,你可以做
>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
A
1 2013-11-23 21:14:34.118531
>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)
But in case you try to load data back, you'll get:
但如果您尝试重新加载数据,您将获得:
>>> df = read_mongo(db, 'myCollection')
>>> df
A
0 1385241274118531000
>>> df.dtypes
A int64
dtype: object
so you'll have to convert 'A' columnt back to datetimes, as well as all not int, floator strfields in your DataFrame. For this example:
所以你必须将“A” columnt回datetimeS,以及所有不int,float或str字段您DataFrame。对于这个例子:
>>> df['A'] = pd.to_datetime(df['A'])
>>> df
A
0 2013-11-23 21:14:34.118531
回答by PasteBT
how about this:
这个怎么样:
db.myCollection.insert({id: df.to_json()})
id will be a unique string for that df
id 将是该 df 的唯一字符串
回答by dieguico
Here you have the very quickest way. Using the insert_manymethod from pymongo 3 and 'records' parameter of to_dictmethod.
在这里,您有最快的方法。使用insert_manypymongo 3 中的方法和方法的“记录”参数to_dict。
db.collection.insert_many(df.to_dict('records'))
回答by Radical Edward
If your dataframe has missing data (i.e None,nan) and you don't want null key values in your documents:
如果您的数据框缺少数据(即无,nan)并且您不希望文档中的键值为空:
db.insert_many(df.to_dict("records"))will insert keys with null values. If you don't want the empty key values in your documents you can use a modified version of pandas .to_dict("records")code below:
db.insert_many(df.to_dict("records"))将插入具有空值的键。如果您不希望文档中出现空键值,您可以使用下面的修改版 Pandas.to_dict("records")代码:
from pandas.core.common import _maybe_box_datetimelike
my_list = [dict((k, _maybe_box_datetimelike(v)) for k, v in zip(df.columns, row) if v != None and v == v) for row in df.values]
db.insert_many(my_list)
where the if v != None and v == vI've added checks to make sure the value is not Noneor nanbefore putting it in the row's dictionary. Now your .insert_manywill only include keys with values in the documents (and no nulldata types).
当if v != None and v == v我添加了检查,以确保该值不None还是nan把它在该行的字典了。现在您.insert_many将只在文档中包含带有值的键(并且没有null数据类型)。
回答by Rafael Valero
I think there is cool ideas in this question. In my case I have been spending time more taking care of the movement of large dataframes. In those case pandas tends to allow you the option of chunksize(for examples in the pandas.DataFrame.to_sql). So I think I con contribute here by adding the function I am using in this direction.
我认为这个问题中有很酷的想法。就我而言,我一直在花更多时间来处理大型数据帧的移动。在这些情况下,大熊猫往往让你的选择CHUNKSIZE(在例子pandas.DataFrame.to_sql)。所以我想我可以通过添加我在这个方向使用的函数来做出贡献。
def write_df_to_mongoDB( my_df,\
database_name = 'mydatabasename' ,\
collection_name = 'mycollectionname',
server = 'localhost',\
mongodb_port = 27017,\
chunk_size = 100):
#"""
#This function take a list and create a collection in MongoDB (you should
#provide the database name, collection, port to connect to the remoete database,
#server of the remote database, local port to tunnel to the other machine)
#
#---------------------------------------------------------------------------
#Parameters / Input
# my_list: the list to send to MongoDB
# database_name: database name
#
# collection_name: collection name (to create)
# server: the server of where the MongoDB database is hosted
# Example: server = '132.434.63.86'
# this_machine_port: local machine port.
# For example: this_machine_port = '27017'
# remote_port: the port where the database is operating
# For example: remote_port = '27017'
# chunk_size: The number of items of the list that will be send at the
# some time to the database. Default is 100.
#
#Output
# When finished will print "Done"
#----------------------------------------------------------------------------
#FUTURE modifications.
#1. Write to SQL
#2. Write to csv
#----------------------------------------------------------------------------
#30/11/2017: Rafael Valero-Fernandez. Documentation
#"""
#To connect
# import os
# import pandas as pd
# import pymongo
# from pymongo import MongoClient
client = MongoClient('localhost',int(mongodb_port))
db = client[database_name]
collection = db[collection_name]
# To write
collection.delete_many({}) # Destroy the collection
#aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
my_list = my_df.to_dict('records')
l = len(my_list)
ran = range(l)
steps=ran[chunk_size::chunk_size]
steps.extend([l])
# Inser chunks of the dataframe
i = 0
for j in steps:
print j
collection.insert_many(my_list[i:j]) # fill de collection
i = j
print('Done')
return
回答by Jon Pan
Just make string keys!
只需制作字符串键!
import json
dfData = json.dumps(df.to_dict('records'))
savaData = {'_id': 'a8e42ed79f9dae1cefe8781760231ec0', 'df': dfData}
res = client.insert_one(savaData)
##### load dfData
data = client.find_one({'_id': 'a8e42ed79f9dae1cefe8781760231ec0'}).get('df')
dfData = json.loads(data)
df = pd.DataFrame.from_dict(dfData)

