如何将 mongodb 集合中的数据加载到 Pandas 的 DataFrame 中?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17805304/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How can I load data from mongodb collection into pandas' DataFrame?
提问by user2161725
I am new to pandas (well, to all things "programming"...), but have been encouraged to give it a try. I have a mongodb database - "test" - with a collection called "tweets". I access the database in ipython:
我是 Pandas 的新手(嗯,对所有“编程”......),但一直被鼓励尝试一下。我有一个 mongodb 数据库 - “test” - 有一个名为“tweets”的集合。我在 ipython 中访问数据库:
import sys
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.test
tweets = db.tweets
the document structure of documents in tweets is as follows:
推文中文档的文档结构如下:
entities': {u'hashtags': [],
u'symbols': [],
u'urls': [],
u'user_mentions': []},
u'favorite_count': 0,
u'favorited': False,
u'filter_level': u'medium',
u'geo': {u'coordinates': [placeholder coordinate, -placeholder coordinate], u'type': u'Point'},
u'id': 349223842700472320L,
u'id_str': u'349223842700472320',
u'in_reply_to_screen_name': None,
u'in_reply_to_status_id': None,
u'in_reply_to_status_id_str': None,
u'in_reply_to_user_id': None,
u'in_reply_to_user_id_str': None,
u'lang': u'en',
u'place': {u'attributes': {},
u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate],
[-placeholder coordinate, placeholder coordinate]]],
u'type': u'Polygon'},
u'country': u'placeholder country',
u'country_code': u'example',
u'full_name': u'name, xx',
u'id': u'user id',
u'name': u'name',
u'place_type': u'city',
u'url': u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
u'retweet_count': 0,
u'retweeted': False,
u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
u'text': u'example text',
u'truncated': False,
u'user': {u'contributors_enabled': False,
u'created_at': u'Sat Jan 22 13:42:59 +0000 2011',
u'default_profile': False,
u'default_profile_image': False,
u'description': u'example description',
u'favourites_count': 100,
u'follow_request_sent': None,
u'followers_count': 100,
u'following': None,
u'friends_count': 100,
u'geo_enabled': True,
u'id': placeholder_id,
u'id_str': u'placeholder_id',
u'is_translator': False,
u'lang': u'en',
u'listed_count': 0,
u'location': u'example place',
u'name': u'example name',
u'notifications': None,
u'profile_background_color': u'000000',
u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif',
u'profile_background_tile': False,
u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
u'profile_image_url': u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_image_url_https': u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
u'profile_link_color': u'000000',
u'profile_sidebar_border_color': u'FFFFFF',
u'profile_sidebar_fill_color': u'000000',
u'profile_text_color': u'000000',
u'profile_use_background_image': False,
u'protected': False,
u'screen_name': placeholder screen_name',
u'statuses_count': xxxx,
u'time_zone': u'placeholder time_zone',
u'url': None,
u'utc_offset': -21600,
u'verified': False}}
Now, as far as I understand, pandas' main data structure - a spreadsheet-like table - is called DataFrame. How can I load the data from my "tweets" collection into pandas' DataFrame? And how can I query for a subdocument within the database?
现在,据我所知,pandas 的主要数据结构——一个类似电子表格的表格——被称为 DataFrame。如何将“推文”集合中的数据加载到 Pandas 的 DataFrame 中?以及如何查询数据库中的子文档?
回答by waitingkuo
Comprehend the cursor you got from the MongoDB before passing it to DataFrame
在将游标传递给 DataFrame 之前理解从 MongoDB 获得的游标
import pandas as pd
df = pd.DataFrame(list(tweets.find()))
回答by Mark Unsworth
If you have data in MongoDb like this:
如果您在 MongoDb 中有这样的数据:
[
{
"name": "Adam",
"age": 27,
"address":{
"number": 4,
"street": "Main Road",
"city": "Oxford"
}
},
{
"name": "Steve",
"age": 32,
"address":{
"number": 78,
"street": "High Street",
"city": "Cambridge"
}
}
]
You can put the data straight into a dataframe like this:
您可以将数据直接放入数据框中,如下所示:
from pandas import DataFrame
df = DataFrame(list(db.collection_name.find({}))
And you will get this output:
你会得到这个输出:
df.head()
| | name | age | address |
|----|---------|------|-----------------------------------------------------------|
| 1 | "Steve" | 27 | {"number": 4, "street": "Main Road", "city": "Oxford"} |
| 2 | "Adam" | 32 | {"number": 78, "street": "High St", "city": "Cambridge"} |
However the subdocuments will just appear as JSON inside the subdocument cell. If you want to flatten objects so that subdocument properties are shown as individual cells you can use json_normalizewithout any parameters.
但是,子文档将在子文档单元格内仅显示为 JSON。如果您想展平对象以便子文档属性显示为单个单元格,您可以使用不带任何参数的json_normalize。
from pandas.io.json import json_normalize
datapoints = list(db.collection_name.find({})
df = json_normalize(datapoints)
df.head()
This will give the dataframe in this format:
这将以这种格式提供数据框:
| | name | age | address.number | address.street | address.city |
|----|--------|------|----------------|----------------|--------------|
| 1 | Thomas | 27 | 4 | "Main Road" | "Oxford" |
| 2 | Mary | 32 | 78 | "High St" | "Cambridge" |
回答by saimadhu.polamuri
You can load your MongoDB data to pandas DataFame using this code. It works for me. Hope for you too.
您可以使用此代码将 MongoDB 数据加载到 Pandas DataFame。这个对我有用。也希望你。
import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))
回答by user3575499
Use: df=pd.DataFrame.from_dict(collection)
使用:df=pd.DataFrame.from_dict(collection)