将 Pandas 数据框 json 列切成列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49081097/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Slice pandas dataframe json column into columns
提问by Moses
I have a data frame which have two columns in JSON format, like this:
我有一个数据框,它有两列 JSON 格式,如下所示:
author biblio series
Mehrdad Vahabi {'volume': 68, 'month': 'January', {'handle':'RePEc:aka:aoecon', 'name': 'Oeconomica'}
'name': 'János Kornai',
'issue': 's', 'handle':
'n:v:68:y:2018:i',
'year': '2018',
'pages': '27-52', 'doi': ''}
Michael Bailey {'c_date': '2017', 'number': {'handle': '', 'name': ''}
'23608', 'handle': 'RePEc:nbr:
nberwo:23608', 'name': 'Measuring'}
I Want to my data frame looks like this:
我想让我的数据框看起来像这样:
author biblio.volume biblio.month biblio.name biblio.issue biblio.handle bibilio.year biblio.pages biblio.doi biblio.c_date bibi?lio.number series.handle series.name
Mehrdad Vahabi 68 January János Kornai s n:v:68:y:2018:i 2018 27-52 NA NA RePEc:aka:aoecon Oeconomica
Michael Bailey NA Na Meausuring NA nberwo:23608 NA NA NA 2017 23608
I try do it using the answers in thisquestion, but no one works for me.
我尝试使用这个问题中的答案来做,但没有人对我来说有效。
How can I do it?
我该怎么做?
[EDIT]Here is a sample of the data
[编辑]这是一个数据示例
[EDIT]
[编辑]
Following the @jezrael solution I get this:
按照@jezrael 解决方案,我得到了这个:
df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns
df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns
col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)
Traceback (most recent call last):
File "dfs.py", line 8, in <module>
df1.columns = 'bibliographic.'+ df1.columns
File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/range.py", line 583, in _evaluate_numeric_binop
other = self._validate_for_numeric_binop(other, op, opstr)
File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/base.py", line 3961, in
_validate_for_numeric_binop
raise TypeError("can only perform ops with scalar values")
TypeError: can only perform ops with scalar values
And with json_normalize
:
并与json_normalize
:
Traceback (most recent call last):
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dfs.py", line 7, in <module>
df = json_normalize(d)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/io/json/normalize.py", line 192, in json_normalize
if any([isinstance(x, dict) for x in compat.itervalues(data[0])]):
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2139, in __getitem__
return self._getitem_column(key)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache
values = self._data.get(item)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 3843, in get
loc = self.items.get_loc(item)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2527, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
Following the @Jhon H solution, I get this:
遵循@Jhon H 解决方案,我得到了这个:
Traceback (most recent call last):
File "dfs.py", line 7, in <module>
jsonSeries = df[['bibliographic']].tolist()
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'tolist'
回答by jezrael
Create for each dict
column new DataFrame
by constructor and last concat
all together:
由构造函数为每一dict
列创建newDataFrame
并最后concat
一起:
df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns
df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns
col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)
author biblio.c_date biblio.doi biblio.handle \
0 Mehrdad Vahabi NaN n:v:68:y:2018:i
1 Michael Bailey 2017 NaN RePEc:nbr:nberwo:23608
biblio.issue biblio.month biblio.name biblio.number biblio.pages \
0 s January Janos Kornai NaN 27-52
1 NaN NaN Measuring 23608 NaN
biblio.volume biblio.year series.handle series.name
0 68.0 2018 RePEc:aka:aoecon Oeconomica
1 NaN NaN
EDIT:
编辑:
If input is json
is possible use json_normalize
:
如果可以输入,请json
使用json_normalize
:
from pandas.io.json import json_normalize
d = [{"author":"Mehrdad Vahabi","biblio":{"volume":68,"month":"January","name":"Janos Kornai","issue":"s","handle":"n:v:68:y:2018:i","year":"2018","pages":"27-52","doi":""},"series":{"handle":"RePEc:aka:aoecon","name":"Oeconomica"}},{"author":"Michael Bailey","biblio":{"c_date":"2017","number":"23608","handle":"RePEc:nbr:nberwo:23608","name":"Measuring"},"series":{"handle":"","name":""}}]
df = json_normalize(d)
print (df)
author biblio.c_date biblio.doi biblio.handle \
0 Mehrdad Vahabi NaN n:v:68:y:2018:i
1 Michael Bailey 2017 NaN RePEc:nbr:nberwo:23608
biblio.issue biblio.month biblio.name biblio.number biblio.pages \
0 s January Janos Kornai NaN 27-52
1 NaN NaN Measuring 23608 NaN
biblio.volume biblio.year series.handle series.name
0 68.0 2018 RePEc:aka:aoecon Oeconomica
1 NaN NaN
EDIT: There is problem your dictionaries are strings, so first is necessary use ast.literal_eval
for convert:
编辑:你的字典是字符串有问题,所以首先必须使用ast.literal_eval
转换:
import ast
df = pd.read_csv('probe.csv')
#print (df)
df1 = pd.DataFrame(df['bibliographic'].apply(ast.literal_eval).values.tolist())
df1.columns = 'bibliographic.'+ df1.columns
df2 = pd.DataFrame(df['series'].apply(ast.literal_eval).values.tolist())
df2.columns = 'series.'+ df2.columns
col = df.columns.difference(['bibliographic','series'])
df = pd.concat([df[col], df1, df2],axis=1)
回答by vumaasha
You need to process the columns individually and join them all together to get the format that you need. Here is a simple example that you could follow
您需要单独处理列并将它们连接在一起以获得所需的格式。这是您可以遵循的简单示例
import pandas as pd
records = [{'col1':'v1','col2':{'a1':1,'b1':1},'col3':{'c1':1,'d1':1}},
{'col1':'v2','col2':{'a1':2,'b1':2},'col3':{'c1':2,'d1':2}}]
sample_df = pd.DataFrame(records)
sample_df
col1 col2 col3
0 v1 {'a1': 1, 'b1': 1} {'c1': 1, 'd1': 1}
1 v2 {'a1': 2, 'b1': 2} {'c1': 2, 'd1': 2}
col2_expanded = sample_df.col2.apply(lambda x:pd.Series(x))
col2_expanded.columns = ['{}.{}'.format('col2',i) for i in col2_expanded]
col2_expanded
col2.a1 col2.b1
0 1 1
1 2 2
col3_expanded = sample_df.col3.apply(lambda x:pd.Series(x))
col3_expanded.columns = ['{}.{}'.format('col3',i) for i in col3_expanded]
col3_expanded
col3.c1 col3.d1
0 1 1
1 2 2
final = pd.concat([sample_df[['col1']],col2_expanded,col3_expanded],axis=1)
final
col1 col2.a1 col2.b1 col3.c1 col3.d1
0 v1 1 1 1 1
1 v2 2 2 2 2