在 Pandas 数据框列中访问字典键和值

Question

提问by Fabio Lamanna

I've got a simple dataframe with a column populated by a python dictionary, in the form:

我有一个简单的数据框，其中一列由 python 字典填充，格式如下：

User                          CLang
111                      {u'en': 1}
112            {u'en': 1, u'es': 1}
112            {u'en': 1, u'es': 1}
113  {u'zh': 1, u'ja': 1, u'es': 2}
113  {u'zh': 1, u'ja': 1, u'es': 2}
113  {u'zh': 1, u'ja': 1, u'es': 2}
114                      {u'es': 1}
113  {u'zh': 1, u'ja': 1, u'es': 2}

The CLangcolumn contains the frequency of different values for each user. How may I have access to single keys and values of the CLangcolumn? For instance I would like to groupby the Userand the most frequent value inside the dictionary, in a form like:

该CLang列包含每个用户的不同值的频率。我如何才能访问CLang列的单个键和值？例如，我想User对字典中出现频率最高的值进行分组，形式如下：

g = df.groupby(['User','CLang')

counting then the number of occurrences for each value:

然后计算每个值的出现次数：

d = g.size().unstack().fillna(0)

The resulting dataframe would appear as:

结果数据框将显示为：

DLang  en  es
User                 
111     1   0
112     1   1
113     0   4
114     0   1

Answer 1

采纳答案by paulo.filip3

I'm not completely sure I understood correctly what you want your output to be and also I don't think using dictin pandas.DataFrameis a very good idea in general.

我不完全确定我是否正确理解了您希望输出的内容，而且我不认为使用dictinpandas.DataFrame通常是一个很好的主意。

Reshaping your DataFrameto something more pandas-likewould be better, you would then be able to use pandasmethods to solve this problem.

将你DataFrame的东西改造成更像Pandas的东西会更好，然后你就可以使用pandas方法来解决这个问题。

Anyway, if you really want to do it, here's a (not very elegant) way:

无论如何，如果你真的想这样做，这里有一个（不是很优雅）的方式：

In [1]: import pandas as pd
In [2]: l1 = [111, 112, 112, 113, 113, 113, 114, 113]
In [3]: l2 = [{'en': 1},
              {'en': 1, 'es': 1},
              {'en': 1, 'es': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 1},
              {'es': 2, 'ja': 1, 'zh': 1}]

In [4]: df = pd.DataFrame({'User': l1, 'CLang': l2})

In [5]: df
Out[5]: 
   User                           CLang
0   111                      {u'en': 1}
1   112            {u'en': 1, u'es': 1}
2   112            {u'en': 1, u'es': 1}
3   113  {u'zh': 1, u'ja': 1, u'es': 2}
4   113  {u'zh': 1, u'ja': 1, u'es': 2}
5   113  {u'zh': 1, u'ja': 1, u'es': 2}
6   114                      {u'es': 1}
7   113  {u'zh': 1, u'ja': 1, u'es': 2}

In [6]: def whatever(row):
  ....:     tmp_d = {}
  ....:     for d in row.values:
  ....:         for k in d.keys():
  ....:             if k in tmp_d.keys():
  ....:                 tmp_d[k] += 1
  ....:             else:
  ....:                 tmp_d[k] = 1
  ....:     return tmp_d

In [7]: new_df = df.groupby('User')['CLang'].apply(whatever).unstack().fillna(0)

In [8]: new_df
Out[8]: 
      en  es  ja  zh
User                
111    1   0   0   0
112    2   2   0   0
113    0   4   4   4
114    0   1   0   0

If you then want to know what was the CLangwith more occurrences you can, also not very elegantly since listin DataFrameshould be avoided, do:

如果你想知道CLang你可以出现更多的情况是什么，也不是很优雅，因为应该避免listin DataFrame，请执行以下操作：

In [9]: def whatever2(row):                                                  
  ....:     tmp_d = {}
  ....:     for i, v in zip(row.index, row.values):
  ....:         if v in tmp_d.keys():
  ....:             tmp_d[v].append(i)
  ....:         else:
  ....:             tmp_d[v] = [i]
  ....:     highest = max(tmp_d.keys())
  ....:     return tmp_d[highest]

In [10]: new_df['Most_Used_CLang'] = new_df.apply(whatever2, axis=1)
In [11]: new_df
Out[11]:
      en  es  ja  zh Most_Used_CLang
User                                
111    1   0   0   0            [en]
112    2   2   0   0        [en, es]
113    0   4   4   4    [es, ja, zh]
114    0   1   0   0            [es]

Answer 2

回答by fixxxer

This, I guess, would be self-explanatory:

我想，这将是不言自明的：

> In [413]: x
Out[413]: 
                            CLang  User
0                      {u'en': 1}   111
1            {u'en': 1, u'es': 1}   112
2            {u'en': 1, u'es': 1}   112
3  {u'zh': 1, u'ja': 1, u'es': 2}   113
4  {u'zh': 1, u'ja': 1, u'es': 2}   113
5  {u'zh': 1, u'ja': 1, u'es': 2}   113
6                      {u'es': 1}   114
7  {u'zh': 1, u'ja': 1, u'es': 2}   113

In [414]: paste
temp = pd.DataFrame(index=x['User'])
for i in x.iterrows():
        clan = i[1]['CLang']
        user = i[1]['User']
        for j in clan.iteritems():
                if j[0] not in temp.columns:
                        temp[j[0]] = pd.np.NaN
                        temp[j[0]][user] = j[1]
                else:
                        temp[j[0]][user] += j[1]

print temp

## -- End pasted text --
      en  es  zh  ja
User                
111    1 NaN NaN NaN
112  NaN   2 NaN NaN
112  NaN   2 NaN NaN
113  NaN NaN   4   4
113  NaN NaN   4   4
113  NaN NaN   4   4
114  NaN NaN NaN NaN
113  NaN NaN   4   4

In [415]: temp.reset_index().groupby('User').sum()
Out[415]: 
      en  es  zh  ja
User                
111    1 NaN NaN NaN
112  NaN   4 NaN NaN
113  NaN NaN  16  16
114  NaN NaN NaN NaN

In [416]:

在 Pandas 数据框列中访问字典键和值

提问by Fabio Lamanna

采纳答案by paulo.filip3

回答by fixxxer

相关推荐

最近更新

标签

在 Pandas 数据框列中访问字典键和值

提问by Fabio Lamanna

采纳答案by paulo.filip3

回答by fixxxer

相关推荐

将 PANDAS 数据框从每月转换为每天

pandas python计算csv列中唯一元素的数量

Pandas 条形图中的自定义图例（matplotlib）

迭代组（Python pandas 数据框）

相关推荐

最近更新

标签