Python 从 json 对象创建熊猫数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20643437/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Create pandas dataframe from json objects
提问by horatio1701d
I finally have output of data I need from a file with many json objects but I need some help with converting the below output into a single dataframe as it loops through the data. Here is the code to produce the output including a sample of what the output looks like:
我终于从一个包含许多 json 对象的文件中输出了我需要的数据,但是我需要一些帮助来将下面的输出转换为单个数据帧,因为它会遍历数据。以下是生成输出的代码,包括输出示例:
original data:
原始数据:
{
"zipcode":"08989",
"current"{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37,"wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}
{
"zipcode":"08990",
"current":{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37, "wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}
def lines_per_n(f, n):
for line in f:
yield ''.join(chain([line], itertools.islice(f, n - 1)))
for fin in glob.glob('*.txt'):
with open(fin) as f:
for chunk in lines_per_n(f, 5):
try:
jfile = json.loads(chunk)
zipcode = jfile['zipcode']
datetime = jfile['current']['proc_time']
triggers = jfile['triggers']
print pd.Series(jfile['zipcode']),
pd.Series(jfile['current']['proc_time']),\
jfile['triggers']
except ValueError, e:
pass
else:
pass
Sample output I get when I run the above which I would like to store in a pandas dataframe as 3 columns.
运行上面的示例输出时,我想将其作为 3 列存储在 Pandas 数据框中。
08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
00544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48]
So the below code seems a lot closer in that it gives me a funky df if I pass the in the list and Transpose the df. Any idea on how I can get this reshaped properly?
所以下面的代码看起来更接近,因为如果我在列表中传递 df 并转置 df,它会给我一个时髦的 df。关于如何正确重塑这个形状的任何想法?
def series_chunk(chunk):
jfile = json.loads(chunk)
zipcode = jfile['zipcode']
datetime = jfile['current']['proc_time']
triggers = jfile['triggers']
return jfile['zipcode'],\
jfile['current']['proc_time'],\
jfile['triggers']
for fin in glob.glob('*.txt'):
with open(fin) as f:
for chunk in lines_per_n(f, 7):
df1 = pd.DataFrame(list(series_chunk(chunk)))
print df1.T
[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]
[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]
Dataframe:
数据框:
0 1 2
0 08988 20131126102946 []
0 1 2
0 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
0 1 2
0 08988 20131126102946 []
0 1 2
0 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
Here is my final code and output. How do I capture each dataframe it creates through the loop and concatenate them on the fly as one dataframe object?
这是我的最终代码和输出。如何捕获它通过循环创建的每个数据帧并将它们作为一个数据帧对象动态连接?
for fin in glob.glob('*.txt'):
with open(fin) as f:
print pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 7)], axis=1).T
0 1 2
0 08988 20131126102946 []
1 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
0 1 2
0 08988 20131126102946 []
1 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
采纳答案by Andy Hayden
Note: For those of you arriving at this question looking to parse json into pandas, if you do have validjson (this question doesn't) then you should use pandas read_jsonfunction:
注意:对于那些想要将 json 解析为 Pandas 的人,如果您确实有有效的json(这个问题没有),那么您应该使用 Pandasread_json函数:
# can either pass string of the json, or a filepath to a file with valid json
In [99]: pd.read_json('[{"A": 1, "B": 2}, {"A": 3, "B": 4}]')
Out[99]:
A B
0 1 2
1 3 4
Check out the IO part of the docsfor several examples, arguments you can pass to this function, as well as ways to normalize less structured json.
查看文档的IO 部分以获取几个示例、您可以传递给此函数的参数,以及规范化结构较差的 json 的方法。
If you don't have valid json, it's often efficient to munge the string before reading in as json, for example see this answer.
如果您没有有效的 json,在作为 json 读入之前处理字符串通常是有效的,例如参见这个答案。
If you have several json files you should concat the DataFrames together (similar to in this answer):
如果您有多个 json 文件,则应将 DataFrame 连接在一起(类似于此答案):
pd.concat([pd.read_json(file) for file in ...], ignore_index=True)
Original answer for this example:
此示例的原始答案:
Use a lookbehind in the regex for the separator passed to read_csv:
在正则表达式中对传递给 read_csv 的分隔符使用回顾:
In [11]: df = pd.read_csv('foo.csv', sep='(?<!,)\s', header=None)
In [12]: df
Out[12]:
0 1 2
0 8988 20131126102946 []
1 8989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
2 8988 20131126102946 []
3 8989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
4 544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
5 601 20131126094911 []
6 602 20131126101056 []
7 603 20131126101056 []
8 604 20131126101056 []
9 544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
10 601 20131126094911 []
11 602 20131126101056 []
12 603 20131126101056 []
13 604 20131126101056 []
[14 rows x 3 columns]
As mentioned in the comments you may be able to do this more directly by concat several Series together... It's also going to be a little easier to follow:
正如评论中提到的,您可以通过将几个系列连接在一起来更直接地做到这一点......它也会更容易遵循:
def series_chunk(chunk):
jfile = json.loads(chunk)
zipcode = jfile['zipcode']
datetime = jfile['current']['proc_time']
triggers = jfile['triggers']
return pd.Series([jfile['zipcode'], jfile['current']['proc_time'], jfile['triggers']])
dfs = []
for fin in glob.glob('*.txt'):
with open(fin) as f:
df = pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 5)], axis=1)
dfs.append(dfs)
df = pd.concat(dfs, ignore_index=True)
Note: You can also move the try/except into series_chunk.
注意:您也可以将 try/except 移动到series_chunk.

