vba Excel宏VBA总结重复值然后删除重复记录
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7160961/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Excel macro VBA to sum up duplicate values and then remove duplicate records
提问by Sunny D'Souza
I am trying to sum up values based on duplicate's found across "A-O" columns. Am using the below macro. There are around 500k+ records and the below macro hangs bad.
我试图根据在“AO”列中发现的重复值来总结值。我正在使用下面的宏。大约有 500k+ 条记录,下面的宏挂起不好。
Sub Formulae(TargetCol1, TargetCol2, ConcatCol, Col1, Col2, StartRow, EndRow, Sheet)
Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Formula = "=SUMIF($" & ConcatCol & "$" & CStr(StartRow) & ":$" & ConcatCol & "$" & CStr(EndRow) & "," & ConcatCol & CStr(StartRow) & ",$" & Col1 & "$" & CStr(StartRow) & ":$" & Col1 & "$" & CStr(EndRow) & ")"
Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Select
Selection.Copy
Sheets(Sheet).Range(TargetCol1 & CStr(EndRow)).Select
Range(Selection, Selection.End(xlUp)).Select
Application.CutCopyMode = False
Selection.FillDown
Call PasteSpecial(TargetCol1, "T", StartRow, EndRow)
Sheets(Sheet).Range(TargetCol2 & CStr(StartRow)).Formula = "=SUMIF($" & ConcatCol & "$" & CStr(StartRow) & ":$" & ConcatCol & "$" & CStr(EndRow) & "," & ConcatCol & CStr(StartRow) & ",$" & Col2 & "$" & CStr(StartRow) & ":$" & Col2 & "$" & CStr(EndRow) & ")"
Sheets(Sheet).Range(TargetCol2 & CStr(StartRow)).Select
Selection.Copy
Sheets(Sheet).Range(TargetCol2 & CStr(EndRow)).Select
Range(Selection, Selection.End(xlUp)).Select
Application.CutCopyMode = False
Selection.FillDown
Call PasteSpecial(TargetCol2, "U", StartRow, EndRow)
End Sub
Sub PasteSpecial(Col1, Col2, StartRow, EndRow)
Range(Col1 & CStr(StartRow)).Select
Range(Selection, Selection.End(xlDown)).Select
Selection.Copy
Range(Col2 & CStr(StartRow)).Select
Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
:=False, Transpose:=False
End Sub
Let me explain the macro in short. I have Columns "A-O" and I have to group them...based on grouping I have to sum columns "P,Q". I have a function that makes a concatenated string out of the 16 columns and stores in "AA" column. Based on this column I use the sumif function to sum all duplicate values
让我简短地解释一下这个宏。我有“AO”列,我必须对它们进行分组……基于分组,我必须对“P、Q”列求和。我有一个函数可以从 16 列中生成一个连接字符串并存储在“AA”列中。基于此列,我使用 sumif 函数对所有重复值求和
=SUMIF($AA:$AA0000,$AA2,$P:$P0000)
=SUMIF($AA:$AA0000,$AA2,$Q:$Q0000)
Then I copy paste special as 'values' the above values to remove the formula, in 2 new cols (pasteSpecial function in above macro code).
然后我复制粘贴特殊作为“值”上述值以删除公式,在 2 个新列中(上面宏代码中的 pasteSpecial 函数)。
Finally I call the remove duplicates to remove the duplicate values
最后我调用删除重复项来删除重复值
I have used the .removeduplicates method which seems to work pretty fast even on such a huge dataset. Is there any predefined function in excel which would even sum the values of the duplicates and then remove the duplicate entries?
我使用了 .removeduplicates 方法,即使在如此庞大的数据集上,它似乎也能很快工作。excel中是否有任何预定义的函数甚至可以对重复项的值求和然后删除重复项?
Sub Remove_Duplicates_In_A_Range(StartRow, EndRow, Sheet, StartCol, EndCol, level)
Sheets(Sheet).Range(StartCol & CStr(StartRow) & ":" & EndCol & CStr(EndRow)).RemoveDuplicates Columns:=20, Header:=xlNo
End Sub
The above logic hangs bad eating all CPU resources and crashing badly...
上面的逻辑很糟糕,吃掉了所有的 CPU 资源并严重崩溃......
Someone please optimize the above macro to make it work with 500k+ records. A performance of 1-2 mins max is acceptable.
有人请优化上述宏以使其能够处理 500k+ 条记录。最多 1-2 分钟的表演是可以接受的。
Please help!!!
请帮忙!!!
EDIT:By 500k+ records I mean A1:O500000. Am supposed to check for duplicates in this manner a combination of A1,B1,C1,D1,E1,F1,G1,H1,I1,J1,K1,L1,M1,N1,O1 with A2,B2,C2,D2,E2,F2,G2,H2,I2,J2,K2,L2,M2,N2,O2 and A3,B3,C3,D3,E3,F3,G3,H3,I3,J3,K3,L3,M3,N3,O3 and so on....till A500000,B500000 etc... .
编辑:通过 500k+ 记录我的意思是 A1:O500000。我应该以这种方式检查重复项 A1,B1,C1,D1,E1,F1,G1,H1,I1,J1,K1,L1,M1,N1,O1 与 A2,B2,C2,D2 的组合, E2、F2、G2、H2、I2、J2、K2、L2、M2、N2、O2和A3、B3、C3、D3、E3、F3、G3、H3、I3、J3、K3、L3、M3、N3、 O3 等等....直到 A500000,B500000 等等....
In short am supposed to check the entire A1-O1 set matches with the entire A2-O2 or A3-O3 or..... A500k-O500k and so on
总之我应该检查整个 A1-O1 集与整个 A2-O2 或 A3-O3 或..... A500k-O500k 等匹配
For every match between the entire A-O recordset I need to sum their respective P,Q columns . Say for example A1-O1 set matched with A2-O2 set then add P1,Q1 and P2,Q2 and store in P1,Q1 or something..
对于整个 AO 记录集之间的每个匹配项,我需要对它们各自的 P,Q 列求和。假设例如 A1-O1 集与 A2-O2 集匹配,然后添加 P1、Q1 和 P2、Q2 并存储在 P1、Q1 或其他东西中。
In either case, I need to retain each original recordset say,A1-O1 with the summed up values of its duplicates and its own in P1,Q1
在任何一种情况下,我都需要保留每个原始记录集,例如,A1-O1 及其重复项的总和值及其在 P1,Q1 中的值
I dont suppose we can attach a demo of the excel sheet here now, can we? :(
我不认为我们现在可以在这里附上 excel 表的演示,可以吗?:(
EDIT2:
编辑2:
Function for replicating sumif formula across all cells
用于在所有单元格中复制 sumif 公式的函数
Sub PreNettingBenefits(StartRow1, EndRow1, StartRow2, EndRow2, Col_Asset, Col_Liab, Src_Col_Asset, Src_Col_Liab, ConcatCol, Src_ConcatCol, level, Sheet2, Sheet1)
'=SUMIF(Sheet1!$AA:$AA336,Sheet2!AA2,Sheet1!$P:$P336)
Application.Calculation = xlCalculationAutomatic
Sheets(Sheet2).Range(Col_Asset & CStr(StartRow2)).Formula = "=SUMIF(" & Sheet1 & "!$" & Src_ConcatCol & "$" & CStr(StartRow1) & ":$" & Src_ConcatCol & "$" & CStr(EndRow1) & "," & Sheet2 & "!" & ConcatCol & CStr(StartRow2) & "," & Sheet1 & "!$" & Src_Col_Asset & "$" & CStr(StartRow1) & ":$" & Src_Col_Asset & "$" & CStr(EndRow1) & ")"
Sheets(Sheet2).Range(Col_Asset & CStr(StartRow2)).Select
Selection.Copy
MsgBox Sheets(Sheet2).Range(Col_Asset & CStr(EndRow2)).Address
Sheets(Sheet2).Range(Col_Asset & CStr(EndRow2)).Select
Range(Col_Asset & CStr(StartRow2) & ":" & Col_Asset & CStr(EndRow2)).Select
Application.CutCopyMode = False
Selection.FillDown
Sheets(Sheet2).Range(Col_Liab & CStr(StartRow2)).Formula = "=SUMIF(" & Sheet1 & "!$" & Src_ConcatCol & "$" & CStr(StartRow1) & ":$" & Src_ConcatCol & "$" & CStr(EndRow1) & "," & Sheet2 & "!" & ConcatCol & CStr(StartRow2) & "," & Sheet1 & "!$" & Src_Col_Liab & "$" & CStr(StartRow1) & ":$" & Src_Col_Liab & "$" & CStr(EndRow1) & ")"
Sheets(Sheet2).Range(Col_Liab & CStr(StartRow2)).Select
Selection.Copy
MsgBox Sheets(Sheet2).Range(Col_Liab & CStr(EndRow2)).Address
Sheets(Sheet2).Range(Col_Liab & CStr(EndRow2)).Select
Range(Col_Liab & CStr(StartRow2) & ":" & Col_Liab & CStr(EndRow2)).Select
Application.CutCopyMode = False
Selection.FillDown
Application.Calculation = xlCalculationManual
End Sub
It hangs pretty bad. Whts the problem in replicating the formula across 30k-40k rows. Could someone please optimise the code?
它挂得很糟糕。在 30k-40k 行之间复制公式有什么问题。有人可以优化代码吗?
回答by aevanko
Something must be terribly wrong with how you are doing the adding of the duplicates. Since you were scant on details of the data you are working with, I don't know if this is the same, but I populated A1:O33334 (over 500k cells) with a random number between 1 and 10,000.
您添加重复项的方式一定是非常错误的。由于您对正在处理的数据的详细信息缺乏了解,我不知道这是否相同,但我使用 1 到 10,000 之间的随机数填充了 A1:O33334(超过 50 万个单元格)。
Using a dictionary object (I am known for my love and over-use of it), I went through all of them and summed only the duplicate values and then slapped the unique list of elements into column A in sheet2.
使用字典对象(我以热爱和过度使用它而闻名),我浏览了所有这些对象,只对重复的值求和,然后将唯一的元素列表放入 sheet2 的 A 列中。
Reasons why a dictionary might be the thing to use:
可能需要使用字典的原因:
- You can weed out duplicates
- You can check if a value exists in the dictionary or not
- You can transpose the unique list easily onto Excel
- 您可以清除重复项
- 您可以检查字典中是否存在某个值
- 您可以轻松地将唯一列表转换到 Excel 中
The dupe checking and addition, and copying the unique cells only takes 2 seconds. Here is the code for your reference.
重复检查和添加以及复制唯一单元格仅需要 2 秒。这是代码供您参考。
Sub test()
Application.ScreenUpdating = False
Dim vArray As Variant
Dim result As Long
Dim dict As Object
Set dict = CreateObject("scripting.dictionary")
vArray = Range("A1:O33334").Value
On Error Resume Next
For i = 1 To UBound(vArray, 1)
For j = 1 To UBound(vArray, 2)
If dict.exists(vArray(i, j)) = False Then
dict.Add vArray(i, j), 1
Else
result = result + vArray(i, j)
End If
Next
Next
Sheet2.Range("a1").Resize(dict.Count).Value = _
Application.Transpose(dict.keys)
Application.ScreenUpdating = True
MsgBox "Total for duplicate cells: " & result & vbLf & _
"Unique cells copied: " & dict.Count
End Sub
回答by JMax
You shouldn't select
every cell when executing code.
select
执行代码时不应该每个单元格。
Btw, if you have a look at your code, some statements are useless:
顺便说一句,如果你看看你的代码,有些语句是没用的:
Sheets(Sheet).Range(TargetCol1 & CStr(StartRow)).Select
Selection.Copy
is never pasted
从不粘贴
For performance issue, see some tips within this thread: Benchmarking VBA Code
对于性能问题,请参阅此线程中的一些提示:Benchmarking VBA Code
回答by Andrew Cowenhoven
The essence of the question, as I understand it, is to find the duplicates and add them up, and then delete them. You also mentioned grouping them but it is not clear how. In any case, I would ditch the macros. Operations on individual rows aren't going to work on that dataset.
据我了解,问题的本质是找到重复项并将它们相加,然后将其删除。您还提到将它们分组,但不清楚如何分组。无论如何,我会放弃宏。对单个行的操作不适用于该数据集。
Here are some steps I would take. Modify them to fit your needs:
以下是我将采取的一些步骤。修改它们以满足您的需求:
Use the concatenate function to create a new column to the right of your dataset. For example
使用连接函数在数据集右侧创建一个新列。例如
=concatenate(a2,b2,c2,d2,e2)
Create a column called Dups and use the following to populate it:
创建一个名为 Dups 的列并使用以下内容填充它:
=if(countif(dataSetNamedRange,aa2)>1,1,0)
In the code above, aa2 refers to the concatenated column for that row. The result of the above is that you now have all dups flagged. Now use the filter tools in the Data menu to create a sort or a filter to fit your grouping needs. To add up the values, use DSum. To delete the dups, use an advanced filter. Good luck.
在上面的代码中,aa2 指的是该行的连接列。上面的结果是您现在标记了所有重复项。现在使用“数据”菜单中的过滤器工具来创建排序或过滤器以满足您的分组需求。要将这些值相加,请使用 DSum。要删除重复项,请使用高级过滤器。祝你好运。
回答by aevanko
I am adding this as a second answer since it's going to get long...
我将其添加为第二个答案,因为它会变得很长......
Becuase I am a stubborn mule, I tried many different things, I think you've reached the limit of what Excel can do. The best function I could come up with was was this, and note I am using 50,000 rows, not your 500,000:
因为我是个固执的骡子,我尝试了很多不同的东西,我想你已经达到了Excel能做的极限。我能想到的最好的函数就是这个,注意我使用的是 50,000 行,而不是你的 500,000:
- 50,000 rows with 100 unique rows, randomly spread: 1m:47s
- 50,000 rows with 50 unique rows, randomly spread: 57s
- 50,000 rows with 25 unique rows, randomly spread: 28s
- 50,000 rows with 10 unique rows, randomly spread: 12s
- 50,000 rows with 5 unique rows, randomly spread: 6s
- 50,000 行,100 个唯一行,随机分布:1m:47s
- 50,000 行,50 个唯一行,随机分布:57s
- 50,000 行,25 行,随机分布:28s
- 50,000 行,10 个唯一行,随机分布:12s
- 50,000 行,5 个唯一行,随机分布:6s
As you can see, the function will deteriorate as the number of unique rows increases. I have a lot of wacky ideas here, so I thought I'd share my code for the sake of research:
如您所见,该函数会随着唯一行数的增加而恶化。我在这里有很多古怪的想法,所以我想我会为了研究而分享我的代码:
- I take the entire range of 750k cells and dump it into a variant array (.2 seconds)
- I dump the P & Q rows into a similar variant array for use later
- I make an array of 50,000 strings (rows) from the variant array (only 1 seconds or so!)
- I say goodbye to the massive variant array to clean up memory
- I start my loop through each row, comparing against all 50k rows...
- If a dupe row is found, it's added to the dupe dictionary so we don't do the same process on that row later
- When the dupe is found, it's P&Q totals are added to the P&Q of the row in question
- After checking all 50k rows, we slap the total into the R column of the row and move on
- If the row has been noted as a dupe in the dupedict, we skip it (evil GoTo beware!)
- 我将整个范围的 750k 单元格转储到一个变体数组中(0.2 秒)
- 我将 P & Q 行转储到类似的变体数组中以备后用
- 我从变体数组中创建了一个包含 50,000 个字符串(行)的数组(仅 1 秒左右!)
- 告别海量变体数组清理内存
- 我开始遍历每一行,与所有 50k 行进行比较......
- 如果找到了 dupe 行,则将其添加到 dupe 字典中,因此我们稍后不会对该行执行相同的处理
- 当发现欺骗时,它的 P&Q 总计被添加到相关行的 P&Q
- 检查所有 50k 行后,我们将总数放入该行的 R 列并继续
- 如果该行在 dupedict 中被标记为 dupe,我们跳过它(邪恶的 GoTo 小心!)
Sub test()
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Dim rowArray As Variant
Dim totalArray As Variant
Dim i As Long, j As Long
Dim dupeDict As Object
Set dupeDict = CreateObject("scripting.dictionary")
Dim count As Long
Dim rowData() As String
'dump the cells into an single array
rowArray = Range("A1:O50000").Value
'grab totals from P and Q to keep them seperate
totalArray = Range("P1:Q50000").Value
'create strings for each row
ReDim rowData(1 To 50000)
'create a string for each row
For i = 1 To 50000
For j = 1 To 15
rowData(i) = rowData(i) & rowArray(i, j)
Next
Next
'free up that memory
Set rowArray = Nothing
'check all rows, total P & Q if match
On Error Resume Next
For i = 1 To 50000
'skip row and move to next if we've seen it
If dupeDict.exists(i) = True Then
GoTo Dupe
End If
count = 0
For j = 1 To 50000
If rowData(i) = rowData(j) Then
dupeDict.Add j, 1 'add that sucker to the dupe dict
count = count + totalArray(j, 1) + totalArray(j, 2)
End If
'enter final total in column R
Cells(i, 18).Value = count
Next
Dupe:
Next
Application.ScreenUpdating = True
Application.Calculation = xlCalculationAutomatic
End Sub