Python 使用 Pandas 解析从 CSV 加载的 JSON 字符串

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20680272/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 21:01:18  来源:igfitidea点击:

Parsing a JSON string which was loaded from a CSV using Pandas

pythonpandas

提问by profesor_tortuga

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:

我正在处理 CSV 文件,其中几个列有一个简单的 json 对象(几个键值对),而其他列是正常的。下面是一个例子:

name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"

After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the statscolumn into additional columns?

使用后df = pandas.read_csv('file.csv'),解析stats列并将其拆分为其他列的最有效方法是什么?

After about an hour, the only thing I could come up with was:

大约一个小时后,我唯一能想到的是:

import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))

This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.

这似乎是我做错了,考虑到我需要定期在三列上执行此操作,这需要做很多工作。

Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:

所需的输出是下面的数据帧对象。添加了以下代码行以我的(蹩脚的)方式到达那里:

df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df

Out[14]:
          name       dob eye_color  height  weight
0   john smith  1/1/1980     brown     160      76
1   dave jones  2/2/1981      blue     170      85
2  bob roberts  3/3/1982     green     180      94

采纳答案by Paul

There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv

有一个稍微简单的方法,但最终你必须调用 json.loads 在 pandas.read_csv 中有一个转换器的概念

converters : dict. optional

Dict of functions for converting values in certain columns. Keys can either be integers or column labels

So first define your custom parser. In this case the below should work:

所以首先定义你的自定义解析器。在这种情况下,以下应该工作:

def CustomParser(data):
    import json
    j1 = json.loads(data)
    return j1

In your case you'll have something like:

在你的情况下,你会有类似的东西:

df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)

We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict

我们告诉 read_csv 以标准方式读取数据,但对于 stats 列,请使用我们的自定义解析器。这将使统计列成为字典

From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)

从这里开始,我们可以使用一个小技巧在一个步骤中使用适当的列名直接附加这些列。这仅适用于常规数据(json 对象需要有 3 个值或至少需要在我们的 CustomParser 中处理缺失值)

df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)

On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.

在左侧,我们从 stats 列元素的键中获取新列名。stats 列中的每个元素都是一个字典。所以我们正在进行批量分配。在右侧,我们使用 apply 拆分“stats”列,从每个键/值对中创建一个数据框。

回答by joris

I think applying the json.loadis a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:

我认为应用json.load是一个好主意,但从那里您可以简单地将其直接转换为数据帧列,而不是再次写入/加载它:

stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)

or alternatively in one step:

或者在一个步骤中:

df.join(df['stats'].apply(json.loads).apply(pd.Series))

回答by abeboparebop

Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)

保罗的原始答案非常好,但总体上并不正确,因为无法保证最后一行的左侧和右侧的列顺序相同。(实际上,它似乎对问题中的测试数据不起作用,而是错误地切换了身高和体重列。)

We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the applyon the RHS automatically sorts by the index, which in this case is the list of column names.

我们可以通过确保 LHS 上的 dict 键列表已排序来解决此问题。这是有效的,因为applyRHS 上的 自动按索引排序,在这种情况下是列名列表。

def CustomParser(data):
  import json
  j1 = json.loads(data)
  return j1

df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)

回答by Pavan Yadiki

json_normalize function in pandas.io.jsonpackage helps to do this without using custom function.

pandas.io.json包中的json_normalize 函数有助于在不使用自定义函数的情况下做到这一点。

(assuming you are loading the data from a file)

(假设您正在从文件加载数据)

from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)

回答by Glen Thompson

Option 1

选项1

If you dumped the column with json.dumpsbefore you wrote it to csv, you can read it back in with:

如果在将列json.dumps写入 csv 之前将其转储,则可以使用以下命令重新读取:

import json
import pandas as pd

df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})

Option 2

选项 2

If you didn't then you might need to use this:

如果你没有,那么你可能需要使用这个:

import json
import pandas as pd

df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})

Option 3

选项 3

For more complicated situations you can write a custom converter like this:

对于更复杂的情况,您可以编写这样的自定义转换器:

import json
import pandas as pd

def parse_column(data):
    try:
        return json.loads(data)
    except Exception as e:
        print(e)
        return None


df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})