Python 从数据框中的列中提取字典值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35711059/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:51:08  来源:igfitidea点击:

Extract dictionary value from column in data frame

pythonpandas

提问by michalk

I'm looking for a way to optimize my code.

我正在寻找一种优化代码的方法。

I have entry data in this form:

我有这种形式的条目数据:

import pandas as pn

a=[{'Feature1': 'aa1','Feature2': 'bb1','Feature3': 'cc2' },
 {'Feature1': 'aa2','Feature2': 'bb2' },
 {'Feature1': 'aa1','Feature2': 'cc1' }
 ]
b=['num1','num2','num3']


df= pn.DataFrame({'num':b, 'dic':a })

I would like to extract element 'Feature3' from dictionaries in column 'dic'(if exist) in above data frame. So far I was able to solve it but I don't know if this is the fastest way, it seems to be a little bit over complicated.

我想从上述数据框中“dic”列(如果存在)的字典中提取元素“Feature3”。到目前为止,我能够解决它,但我不知道这是否是最快的方法,它似乎有点过于复杂。

Feature3=[]
for idx, row in df['dic'].iteritems():
    l=row.keys()

    if 'Feature3' in l:
        Feature3.append(row['Feature3'])
    else:
        Feature3.append(None)

df['Feature3']=Feature3
print df

Is there a better/faster/simpler way do extract this Feature3 to separate column in the dataframe?

有没有更好/更快/更简单的方法来提取这个 Feature3 来分隔数据帧中的列?

Thank you in advance for help.

预先感谢您的帮助。

回答by Alexander

You can use a list comprehension to extract feature 3 from each row in your dataframe, returning a list.

您可以使用列表理解从数据框中的每一行中提取特征 3,返回一个列表。

feature3 = [d.get('Feature3') for d in df.dic]

If 'Feature3' is not in dic, it returns None by default.

如果 'Feature3' 不在 中dic,则默认返回 None。

You don't even need pandas, as you can again use a list comprehension to extract the feature from your original dictionary a.

您甚至不需要熊猫,因为您可以再次使用列表理解从原始字典中提取特征a

feature3 = [d.get('Feature3') for d in a]

回答by as133

df['Feature3'] = df['dic'].apply(lambda x: x.get('Feature3'))

Agree with maxymoo. Consider changing the format of your dataframe.

同意maxymoo。考虑更改数据框的格式。

(Sidenote: pandas is generally imported as pd)

(旁注:pandas 通常作为 pd 导入)

回答by Ami Tavory

If you applya Series, you get a quite nice DataFrame:

如果你applySeries,你会得到一个很好的DataFrame

>>> df.dic.apply(pn.Series)
    Feature1    Feature2    Feature3
0   aa1 bb1 cc2
1   aa2 bb2 NaN
2   aa1 cc1 NaN

From this point, you can just use regular pandas operations.

从这一点来看,您可以只使用常规的 Pandas 操作。

回答by maxymoo

I think you're thinking about the data structures slightly wrong. It's better to create the data frame with the features as columns from the start; pandas is actually smart enough to do this by default:

我认为您正在考虑的数据结构略有错误。最好从一开始就创建以特征为列的数据框;pandas 实际上很聪明,可以默认执行此操作:

In [240]: pd.DataFrame(a)
Out[240]:
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

You would then add on your "num" column in a separate step, since the data is in a different orientation, either with

然后,您将在单独的步骤中添加“num”列,因为数据处于不同的方向,要么使用

df['num'] = b

or

或者

df = df.assign(num = b)

(I prefer the second option since it's got a more functional flavour).

(我更喜欢第二种选择,因为它具有更实用的风味)。

回答by jezrael

I think you can first create new DataFrameby comprehensionand then create new column like:

我认为您可以先创建 new DataFramebycomprehension然后创建新列,例如:

df1 = pd.DataFrame([x for x in df['dic']])
print df1
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

df['Feature3'] = df1['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Or one line:

或一行:

df['Feature3'] = pd.DataFrame([x for x in df['dic']])['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Timings:

时间

len(df) = 3:

len(df) = 3

In [24]: %timeit pd.DataFrame([x for x in df['dic']])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 596 μs per loop

In [25]: %timeit df.dic.apply(pn.Series)
1000 loops, best of 3: 1.43 ms per loop

len(df) = 3000:

len(df) = 3000

In [27]: %timeit pd.DataFrame([x for x in df['dic']])
100 loops, best of 3: 3.16 ms per loop

In [28]: %timeit df.dic.apply(pn.Series)
1 loops, best of 3: 748 ms per loop