pandas 使用pandas读取JSON文件进行Python分析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22811791/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:53:12  来源:igfitidea点击:

Using pandas to read JSON file for Python analysis

pythonjsonpandasanalysis

提问by user1745447

I'm running into some issues, trying to load a JSON file in my Python editor so that I can run some analysis on the data within.

我遇到了一些问题,试图在我的 Python 编辑器中加载一个 JSON 文件,以便我可以对其中的数据进行一些分析。

The JSON file is in the following folder: 'C:\Users\Admin\JSON files\file1.JSON'

JSON 文件位于以下文件夹中: 'C:\Users\Admin\JSON files\file1.JSON'

It contains the following tweet data (this is just one record, there are hundreds in there):

它包含以下推文数据(这只是一条记录,其中有数百条记录):

{
    "created": "Fri Mar 13 18:09:33 GMT 2014",
    "description": "Tweeting the latest Playstation news!",
    "favourites_count": 4514,
    "followers": 235,
    "following": 1345,
    "geo_lat": null,
    "geo_long": null,
    "hashtags": "",
    "id": 2144411414,
    "is_retweet": false,
    "is_truncated": false,
    "lang": "en",
    "location": "",
    "media_urls": "",
    "mentions": "",
    "name": "Playstation News",
    "original_text": null,
    "reply_status_id": 0,
    "reply_user_id": 0,
    "retweet_count": 4514,
    "retweet_id": 0,
    "score": 0.0,
    "screen_name": "SevenPS4",
    "source": "<a href=\"http://twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>",
    "text": "tweetinfohere",
    "timezone": "Amsterdam",
    "url": null,
    "urls": "http://bit.ly/1lcbBW6",
    "user_created": "2013-05-19",
    "user_id": 13313,
    "utc_offset": 3600
}

I am using the following code to try and test this data:

我正在使用以下代码来尝试测试此数据:

import json
import pandas as pa
z = pa.read_json('C:\Users\Admin\JSON files\file1.JSON')
d = pa.DataFrame.from_dict([{k:v} for k,v in z.iteritems() if k in ["retweet_count", "user_id", "is_retweet"]])
print d.retweet_count.sum()

When I run this, it successfully reads the JSON file then prints out a list of the retweet_count's like this:

当我运行它时,它成功读取了 JSON 文件,然后像这样打印出 retweet_count 的列表:

0, 4514 1, 300 2, 450 3, 139etc etc

0, 4514 1, 300 2, 450 3, 139等等等等

My questions: How do I actually sum up all of the retweet_count/user_id values rather than just listing them like shown above?

我的问题:我如何实际总结所有的 retweet_count/user_id 值,而不是像上面显示的那样只列出它们?

How do I then divide this sum by the number of entries to get an average?

然后我如何将这个总和除以条目数以获得平均值?

How can I choose a sample size of the JSON data rather than use it all? (I thought it was d.iloc[:10] but that doesn't work)

如何选择 JSON 数据的样本大小而不是全部使用?(我以为是 d.iloc[:10] 但这不起作用)

With the 'is_retweet' field in the JSON file, is it possible to make a count for the amount of false/trues that are given? IE within the JSON file, I want the number of tweets that were retweeted and the number that weren't.

使用 JSON 文件中的“is_retweet”字段,是否可以计算给出的假/真数量?IE 在 JSON 文件中,我想要被转发的推文数量和未被转发的数量。

Thanks in advance, yeah I'm pretty new to this..

在此先感谢,是的,我对此很陌生..

z.info()gives:

z.info()给出:

<class 'pandas.core.frame.DataFrame'> Int64Index: 506 entries, 0 to 505 Data columns (total 31 columns): created 506 non-null object description 506 non-null object favourites_count 506 non-null int64 followers 506 non-null int64 following 506 non-null int64 geo_lat 10 non-null float64 geo_long 10 non-null float64 hashtags 506 non-null object id 506 non-null int64 is_retweet 506 non-null bool is_truncated 506 non-null bool lang 506 non-null object location 506 non-null object media_urls 506 non-null object mentions 506 non-null object name 506 non-null object original_text 172 non-null object reply_status_id 506 non-null int64 reply_user_id 506 non-null int64 retweet_id 506 non-null int64 retweet_count 506 non_null int64 score 506 non-null int64 screen_name 506 non-null object source 506 non-null object status_count 506 non-null int64 text 506 non-null object timezone 415 non-null object url 273 non-null object urls 506 non-null object user_created 506 non-null object user_id 506 non-null int64 utc_offset 506 non-null int64 dtypes: bool(2), float64(2), int64(11), object(16)

<class 'pandas.core.frame.DataFrame'> Int64Index: 506 entries, 0 to 505 Data columns (total 31 columns): created 506 non-null object description 506 non-null object favourites_count 506 non-null int64 followers 506 non-null int64 following 506 non-null int64 geo_lat 10 non-null float64 geo_long 10 non-null float64 hashtags 506 non-null object id 506 non-null int64 is_retweet 506 non-null bool is_truncated 506 non-null bool lang 506 non-null object location 506 non-null object media_urls 506 non-null object mentions 506 non-null object name 506 non-null object original_text 172 non-null object reply_status_id 506 non-null int64 reply_user_id 506 non-null int64 retweet_id 506 non-null int64 retweet_count 506 non_null int64 score 506 non-null int64 screen_name 506 non-null object source 506 non-null object status_count 506 non-null int64 text 506 non-null object timezone 415 non-null object url 273 non-null object urls 506 non-null object user_created 506 non-null object user_id 506 non-null int64 utc_offset 506 non-null int64 dtypes: bool(2), float64(2), int64(11), object(16)

How come it is showing retweet_count and user_id as objects when I run d.info()?

当我运行 d.info() 时,它为什么将 retweet_count 和 user_id 显示为对象?

回答by myacobucci

d.retweet_countis a list of dictionaries for your retweet_countscorrect?

d.retweet_count是您retweet_counts正确的字典列表吗?

So to get the sum:

所以要得到总和:

keys = d.retweet_count.keys()
sum = 0
for items in keys:
    sum+=d.retweet_count[items]

To get the average:

要获得平均值:

avg = sum/len(keys)

Now to get a sample size just divide up keys:

现在要获得样本大小,只需将其分开keys

sample_keys = keys[0:10]

to get the mean

得到平均值

for items in sample_keys:
     sum+=d.retweet_count[items]
avg = sum/len(sample_keys)