pandas 熊猫使用日期和另一列合并两列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48031445/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:59:14  来源:igfitidea点击:

Pandas merge on two columns using date and another column

pythonpandaspandasql

提问by Hobes

Hello trying to merge two data frames and sum visit counts by date and upc.

您好,尝试合并两个数据框并按日期和 upc 汇总访问计数。

  1. Transaction data (date,upc,sales) 200k rows x 3 columns
  2. Visits counts(date, upc, visit count) 2 million+ rows x 3 columns
  1. 交易数据(日期、UPC、销售额) 200k 行 x 3 列
  2. 访问计数(日期,upc,访问计数)200 万+ 行 x 3 列

I've tried this

我试过这个

df3 = pandas.merge(df1,df2, on = ['upc','date'], how = 'left') 

Result: merge executes but it does not sum up by date or upc

结果:merge 执行但不按日期或 upc 求和

I also tried

我也试过

df3 = pandas.merge(df1,df2, left_on = ['date'], right_on ['upc'] how = 'left')

and that didn't work.

那没有用。

df3 = pandas.merge(df1,df2, left_on = ['date','upc'], right_on ['date','upc'] how = 'left')

and that didn't work.

那没有用。

I also tried

我也试过

df3 = pandas.merge(df1,df2, on = ['date'], how = 'left')

and I kept returning an error message. Based on the error message it looked like I needed to convert one of the dates in the data frames to pandas dtype.

我一直在返回错误消息。根据错误消息,我似乎需要将数据框中的日期之一转换为 pandas dtype。

I made that change and returned the same results as my first try. The merge worked but it did not sum up the results. I tried converting both dates in both data frames to astype(str)and that didn't work. I learned if both dates have the same date format dtype or astype(str)I return a memory error message. Merge would fail.

我进行了更改并返回了与第一次尝试相同的结果。合并有效,但没有总结结果。我尝试将两个数据框中的两个日期都转换为astype(str),但没有用。我了解到两个日期是否具有相同的日期格式 dtype 或者astype(str)我返回内存错误消息。合并会失败。

I was successful with merging using the upc only but this creates an issue in my data, I return duplicate visit numbers because a upc is repeated in the transaction due to the date column.

我仅使用 upc 成功合并,但这在我的数据中造成了问题,我返回重复的访问次数,因为由于日期列在事务中重复了 upc。

End of the day what I need is something similar to a sumif function in excel.

归根结底,我需要的是类似于 excel 中的 sumif 函数的东西。

I need to combine the two data sets by summarizing the total visits by each upc for each day and keeping transaction data unchanged or left join in terms of sql

我需要通过总结每个 upc 每天的总访问量并保持交易数据不变或在 sql 方面保持连接来组合两个数据集

Sample data

样本数据

df1
  Date         upc       sales
0 09/01/2016   A01234    1000
1 09/02/2016   A01234    500
2 09/10/2016   A56789    1200

df2
  Date         upc         visits
0 09/01/2016   A01234      10
1 09/02/2016   A01234      25
2 09/05/2016   A56789      26
3 09/10/2016   A56789      32


df3
  Date         upc       sales   visits
0 09/01/2016   A01234    1000    10
1 09/02/2016   A01234    500     25
2 09/10/2016   A56789    1200    32

Wondering if pandasql package is what I need to use. Any help is appreciated

想知道我是否需要使用 pandasql 包。任何帮助表示赞赏

回答by PaSTE

The first merge statement you perform should get you halfway there, but unfortunately, it's the second half of a two-step process. It sounds like you want to merge the sales data onto the visits data aftersumming the visits by Date/upc. You actually have to do a sum first (the mergecommand does not do this by itself). Try:

您执行的第一个合并语句应该可以让您完成一半,但不幸的是,这是两步过程的后半部分。听起来您想在按 Date/upc汇总访问后将销售数据合并到访问数据中。您实际上必须先计算总和(merge命令本身不会执行此操作)。尝试:

df2_sum = df2.groupby(["Date", "upc"])["visits"].sum().reset_index()

Then left-merge this onto the sales data:

然后将其左合并到销售数据上:

df3 = pd.merge(df1, df2_sum, on=["Date", "upc"], how="left")