Python 从数据框中的列中提取字典值

Question

提问by michalk

I'm looking for a way to optimize my code.

我正在寻找一种优化代码的方法。

I have entry data in this form:

我有这种形式的条目数据：

import pandas as pn

a=[{'Feature1': 'aa1','Feature2': 'bb1','Feature3': 'cc2' },
 {'Feature1': 'aa2','Feature2': 'bb2' },
 {'Feature1': 'aa1','Feature2': 'cc1' }
 ]
b=['num1','num2','num3']


df= pn.DataFrame({'num':b, 'dic':a })

I would like to extract element 'Feature3' from dictionaries in column 'dic'(if exist) in above data frame. So far I was able to solve it but I don't know if this is the fastest way, it seems to be a little bit over complicated.

我想从上述数据框中“dic”列（如果存在）的字典中提取元素“Feature3”。到目前为止，我能够解决它，但我不知道这是否是最快的方法，它似乎有点过于复杂。

Feature3=[]
for idx, row in df['dic'].iteritems():
    l=row.keys()

    if 'Feature3' in l:
        Feature3.append(row['Feature3'])
    else:
        Feature3.append(None)

df['Feature3']=Feature3
print df

Is there a better/faster/simpler way do extract this Feature3 to separate column in the dataframe?

有没有更好/更快/更简单的方法来提取这个 Feature3 来分隔数据帧中的列？

Thank you in advance for help.

预先感谢您的帮助。

Answer 1

回答by Alexander

You can use a list comprehension to extract feature 3 from each row in your dataframe, returning a list.

您可以使用列表理解从数据框中的每一行中提取特征 3，返回一个列表。

feature3 = [d.get('Feature3') for d in df.dic]

If 'Feature3' is not in dic, it returns None by default.

如果 'Feature3' 不在中dic，则默认返回 None。

You don't even need pandas, as you can again use a list comprehension to extract the feature from your original dictionary a.

您甚至不需要熊猫，因为您可以再次使用列表理解从原始字典中提取特征a。

feature3 = [d.get('Feature3') for d in a]

Answer 2

回答by as133

df['Feature3'] = df['dic'].apply(lambda x: x.get('Feature3'))

Agree with maxymoo. Consider changing the format of your dataframe.

同意maxymoo。考虑更改数据框的格式。

(Sidenote: pandas is generally imported as pd)

（旁注：pandas 通常作为 pd 导入）

Answer 3

回答by Ami Tavory

If you applya Series, you get a quite nice DataFrame:

如果你apply是Series，你会得到一个很好的DataFrame：

>>> df.dic.apply(pn.Series)
    Feature1    Feature2    Feature3
0   aa1 bb1 cc2
1   aa2 bb2 NaN
2   aa1 cc1 NaN

From this point, you can just use regular pandas operations.

从这一点来看，您可以只使用常规的 Pandas 操作。

Answer 4

回答by maxymoo

I think you're thinking about the data structures slightly wrong. It's better to create the data frame with the features as columns from the start; pandas is actually smart enough to do this by default:

我认为您正在考虑的数据结构略有错误。最好从一开始就创建以特征为列的数据框；pandas 实际上很聪明，可以默认执行此操作：

In [240]: pd.DataFrame(a)
Out[240]:
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

You would then add on your "num" column in a separate step, since the data is in a different orientation, either with

然后，您将在单独的步骤中添加“num”列，因为数据处于不同的方向，要么使用

df['num'] = b

or

或者

df = df.assign(num = b)

(I prefer the second option since it's got a more functional flavour).

（我更喜欢第二种选择，因为它具有更实用的风味）。

Answer 5

回答by jezrael

I think you can first create new DataFrameby comprehensionand then create new column like:

我认为您可以先创建 new DataFramebycomprehension然后创建新列，例如：

df1 = pd.DataFrame([x for x in df['dic']])
print df1
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

df['Feature3'] = df1['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Or one line:

或一行：

df['Feature3'] = pd.DataFrame([x for x in df['dic']])['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Timings:

时间：

len(df) = 3:

len(df) = 3：

In [24]: %timeit pd.DataFrame([x for x in df['dic']])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 596 μs per loop

In [25]: %timeit df.dic.apply(pn.Series)
1000 loops, best of 3: 1.43 ms per loop

len(df) = 3000:

len(df) = 3000：

In [27]: %timeit pd.DataFrame([x for x in df['dic']])
100 loops, best of 3: 3.16 ms per loop

In [28]: %timeit df.dic.apply(pn.Series)
1 loops, best of 3: 748 ms per loop

Python 从数据框中的列中提取字典值

提问by michalk

回答by Alexander

回答by as133

回答by Ami Tavory

回答by maxymoo

回答by jezrael

相关推荐

最近更新

标签

Python 从数据框中的列中提取字典值

提问by michalk

回答by Alexander

回答by as133

回答by Ami Tavory

回答by maxymoo

回答by jezrael

相关推荐

如何在python中读取文件夹中的txt文件列表

导入错误：没有名为 IPython 的模块

以兼容 Python 2.7 和 Python 3.5 的方式使用 abc.ABCMeta

Python 是否将 Anaconda 添加到 Path

相关推荐

最近更新

标签