在 Pandas 数据框列中访问字典键和值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/29848060/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 23:15:13  来源:igfitidea点击:

Access dictionary keys and values in pandas dataframe column

pythonpandas

提问by Fabio Lamanna

I've got a simple dataframe with a column populated by a python dictionary, in the form:

我有一个简单的数据框,其中一列由 python 字典填充,格式如下:

User                          CLang
111                      {u'en': 1}
112            {u'en': 1, u'es': 1}
112            {u'en': 1, u'es': 1}
113  {u'zh': 1, u'ja': 1, u'es': 2}
113  {u'zh': 1, u'ja': 1, u'es': 2}
113  {u'zh': 1, u'ja': 1, u'es': 2}
114                      {u'es': 1}
113  {u'zh': 1, u'ja': 1, u'es': 2}

The CLangcolumn contains the frequency of different values for each user. How may I have access to single keys and values of the CLangcolumn? For instance I would like to groupby the Userand the most frequent value inside the dictionary, in a form like:

CLang列包含每个用户的不同值的频率。我如何才能访问CLang列的单个键和值?例如,我想User对字典中出现频率最高的值进行分组,形式如下:

g = df.groupby(['User','CLang')

counting then the number of occurrences for each value:

然后计算每个值的出现次数:

d = g.size().unstack().fillna(0)

The resulting dataframe would appear as:

结果数据框将显示为:

DLang  en  es
User                 
111     1   0
112     1   1
113     0   4
114     0   1

采纳答案by paulo.filip3

I'm not completely sure I understood correctly what you want your output to be and also I don't think using dictin pandas.DataFrameis a very good idea in general.

我不完全确定我是否正确理解了您希望输出的内容,而且我不认为使用dictinpandas.DataFrame通常是一个很好的主意。

Reshaping your DataFrameto something more pandas-likewould be better, you would then be able to use pandasmethods to solve this problem.

将你DataFrame的东西改造成更像Pandas的东西会更好,然后你就可以使用pandas方法来解决这个问题。

Anyway, if you really want to do it, here's a (not very elegant) way:

无论如何,如果你真的想这样做,这里有一个(不是很优雅)的方式:

In [1]: import pandas as pd
In [2]: l1 = [111, 112, 112, 113, 113, 113, 114, 113]
In [3]: l2 = [{'en': 1},
              {'en': 1, 'es': 1},
              {'en': 1, 'es': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 2, 'ja': 1, 'zh': 1},
              {'es': 1},
              {'es': 2, 'ja': 1, 'zh': 1}]

In [4]: df = pd.DataFrame({'User': l1, 'CLang': l2})

In [5]: df
Out[5]: 
   User                           CLang
0   111                      {u'en': 1}
1   112            {u'en': 1, u'es': 1}
2   112            {u'en': 1, u'es': 1}
3   113  {u'zh': 1, u'ja': 1, u'es': 2}
4   113  {u'zh': 1, u'ja': 1, u'es': 2}
5   113  {u'zh': 1, u'ja': 1, u'es': 2}
6   114                      {u'es': 1}
7   113  {u'zh': 1, u'ja': 1, u'es': 2}

In [6]: def whatever(row):
  ....:     tmp_d = {}
  ....:     for d in row.values:
  ....:         for k in d.keys():
  ....:             if k in tmp_d.keys():
  ....:                 tmp_d[k] += 1
  ....:             else:
  ....:                 tmp_d[k] = 1
  ....:     return tmp_d

In [7]: new_df = df.groupby('User')['CLang'].apply(whatever).unstack().fillna(0)

In [8]: new_df
Out[8]: 
      en  es  ja  zh
User                
111    1   0   0   0
112    2   2   0   0
113    0   4   4   4
114    0   1   0   0

If you then want to know what was the CLangwith more occurrences you can, also not very elegantly since listin DataFrameshould be avoided, do:

如果你想知道CLang你可以出现更多的情况是什么,也不是很优雅,因为应该避免listin DataFrame,请执行以下操作:

In [9]: def whatever2(row):                                                  
  ....:     tmp_d = {}
  ....:     for i, v in zip(row.index, row.values):
  ....:         if v in tmp_d.keys():
  ....:             tmp_d[v].append(i)
  ....:         else:
  ....:             tmp_d[v] = [i]
  ....:     highest = max(tmp_d.keys())
  ....:     return tmp_d[highest]

In [10]: new_df['Most_Used_CLang'] = new_df.apply(whatever2, axis=1)
In [11]: new_df
Out[11]:
      en  es  ja  zh Most_Used_CLang
User                                
111    1   0   0   0            [en]
112    2   2   0   0        [en, es]
113    0   4   4   4    [es, ja, zh]
114    0   1   0   0            [es]

回答by fixxxer

This, I guess, would be self-explanatory:

我想,这将是不言自明的:

> In [413]: x
Out[413]: 
                            CLang  User
0                      {u'en': 1}   111
1            {u'en': 1, u'es': 1}   112
2            {u'en': 1, u'es': 1}   112
3  {u'zh': 1, u'ja': 1, u'es': 2}   113
4  {u'zh': 1, u'ja': 1, u'es': 2}   113
5  {u'zh': 1, u'ja': 1, u'es': 2}   113
6                      {u'es': 1}   114
7  {u'zh': 1, u'ja': 1, u'es': 2}   113

In [414]: paste
temp = pd.DataFrame(index=x['User'])
for i in x.iterrows():
        clan = i[1]['CLang']
        user = i[1]['User']
        for j in clan.iteritems():
                if j[0] not in temp.columns:
                        temp[j[0]] = pd.np.NaN
                        temp[j[0]][user] = j[1]
                else:
                        temp[j[0]][user] += j[1]

print temp

## -- End pasted text --
      en  es  zh  ja
User                
111    1 NaN NaN NaN
112  NaN   2 NaN NaN
112  NaN   2 NaN NaN
113  NaN NaN   4   4
113  NaN NaN   4   4
113  NaN NaN   4   4
114  NaN NaN NaN NaN
113  NaN NaN   4   4

In [415]: temp.reset_index().groupby('User').sum()
Out[415]: 
      en  es  zh  ja
User                
111    1 NaN NaN NaN
112  NaN   4 NaN NaN
113  NaN NaN  16  16
114  NaN NaN NaN NaN

In [416]: