使用重复的索引值旋转 Pandas 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29926698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pivoting a pandas dataframe with duplicate index values
提问by metersk
I have a data frame which has rows for each user joining my site and making a purchase.
我有一个数据框,其中包含每个用户加入我的网站并进行购买的行。
+---+-----+--------------------+---------+--------+-----+
| | uid | msg | _time | gender | age |
+---+-----+--------------------+---------+--------+-----+
| 0 | 1 | confirmed_settings | 1/29/15 | M | 37 |
| 1 | 1 | sale | 4/13/15 | M | 37 |
| 2 | 3 | confirmed_settings | 4/19/15 | M | 35 |
| 3 | 4 | confirmed_settings | 2/21/15 | M | 21 |
| 4 | 5 | confirmed_settings | 3/28/15 | M | 18 |
| 5 | 4 | sale | 3/15/15 | M | 21 |
+---+-----+--------------------+---------+--------+-----+
I would like to change the dataframe so that each row is unique for a uid and there is a columns called saleand confirmed_settingswhich have the timestamp of the action. Note that not every user has a sale, but every user has a confirmed_settings. Like below:
我想更改数据框,以便每行对于 uid 都是唯一的,并且有一个列被调用sale并且confirmed_settings具有操作的时间戳。请注意,并非每个用户都有一个sale,但每个用户都有一个confirmed_settings。像下面这样:
+---+-----+--------------------+---------+---------+--------+-----+
| | uid | confirmed_settings | sale | _time | gender | age |
+---+-----+--------------------+---------+---------+--------+-----+
| 0 | 1 | 1/29/15 | 4/13/15 | 1/29/15 | M | 37 |
| 1 | 3 | 4/19/15 | null | 4/19/15 | M | 35 |
| 2 | 4 | 2/21/15 | 3/15/15 | 2/21/15 | M | 21 |
| 3 | 5 | 3/28/15 | null | 3/28/15 | M | 18 |
+---+-----+--------------------+---------+---------+--------+-----+
To do this, I am trying:
为此,我正在尝试:
df1 = df.pivot(index='uid', columns='msg', values='_time').reset_index()
df1 = df1.merge(df[['uid', 'gender', 'age']].drop_duplicates(), on='uid')
But I get this error: ValueError: Index contains duplicate entries, cannot reshape
但我收到此错误: ValueError: Index contains duplicate entries, cannot reshape
How can I pivot a df with duplicate index values to transform my dataframe?
如何使用重复的索引值旋转 df 以转换我的数据框?
Edit:
df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()
编辑:
df1 = df.pivot_table(index='uid', columns='msg', values='_time').reset_index()
gives this error DataError: No numeric types to aggregatebut im not even sure that is the right path to go on.
给出了这个错误,DataError: No numeric types to aggregate但我什至不确定这是正确的道路。
采纳答案by fixxxer
xis the data frame that you have as input :
x是您作为输入的数据框:
uid msg _time gender age
0 1 confirmed_settings 1/29/15 M 37
1 1 sale 4/13/15 M 37
2 3 confirmed_settings 4/19/15 M 35
3 4 confirmed_settings 2/21/15 M 21
4 5 confirmed_settings 3/28/15 M 18
5 4 sale 3/15/15 M 21
y = x.pivot(index='uid', columns='msg', values='_time')
x.join(y).drop('msg', axis=1)
gives you:
给你:
uid _time gender age confirmed_settings sale
0 1 1/29/15 M 37 NaN NaN
1 1 4/13/15 M 37 1/29/15 4/13/15
2 3 4/19/15 M 35 NaN NaN
3 4 2/21/15 M 21 4/19/15 NaN
4 5 3/28/15 M 18 2/21/15 3/15/15
5 4 3/15/15 M 21 3/28/15 NaN
回答by selwyth
I suspect there are indeed duplicate uid-msgentries/keys (e.g. uid2 has 2 confirmed_settings entries under msg), which you alluded to in the comments for fixxxer's answer. If there are, you can't use pivot, because you can't tell it how to treat the different values encountered during aggregation (count? max? mean? sum?). Note that the Index error is an error on the Index of the resulting pivoted table df1, not the original DataFrame df.
我怀疑确有重复的uid-msg项/关键词(例如uid2具有下2 confirmed_settings项msg),您在fixxxer的答案的评论暗示。如果有,则不能使用pivot,因为您无法告诉它如何处理聚合期间遇到的不同值(计数?最大值?平均值?总和?)。请注意,索引错误是生成df1的数据透视表的索引上的错误,而不是原始 DataFrame 的错误df。
pivot_tablelets you do it however with the aggfuncargument. How about something like this?
pivot_table但是让你用aggfunc参数来做。这样的事情怎么样?
df1 = df.pivot_table(index = 'uid', columns = 'msg', values = '_time', aggfunc = len)
This will help you figure out which user-msg records have duplicate entries (anything with over 1), and after cleaning them out, you can use pivoton dfto successfully pivot _time.
这将帮助您找出哪些 user-msg 记录具有重复条目(任何超过 1 的条目),并且在清除它们之后,您可以使用pivotondf成功进行 pivot _time。
回答by Alexander
You can use groupby to aggregate by the common factors, take the max of time to get the most recent dates, and then unstack the msg to view confirmed_settings and sale side by side:
您可以使用 groupby 按公因数聚合,使用最大时间获取最近的日期,然后将 msg 拆开以查看confirmed_settings 和 sale 并排:
df.groupby(['uid', 'msg', 'gender', 'age']).time.max().unstack('msg')
msg confirmed_settings sale
uid gender age
1 M 37 1/29/15 4/13/15
3 M 35 4/19/15 NaN
4 M 21 2/21/15 3/15/15
5 M 18 3/28/15 NaN

