git push 对于一个分支来说非常慢
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/29118876/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
git push is very slow for a branch
提问by grahamrhay
We have a git repo that is quite large (ios app resources). I appreciate that git is going to be slow when working with it, but if I create a new branch and edit a couple of files (not binary ones) and push, it takes forever.
我们有一个相当大的 git repo(ios 应用程序资源)。我很欣赏 git 在使用它时会很慢,但是如果我创建一个新分支并编辑几个文件(不是二进制文件)并推送,它需要永远。
It feels like the entire repo is being pushed. I was under the impression that git would only send the diff, is that wrong? (I know git stores compressed versions of the whole file, I mean the diff between my branch and where I branched from).
感觉整个 repo 都被推送了。我的印象是 git 只会发送差异,这是错误的吗?(我知道 git 存储整个文件的压缩版本,我的意思是我的分支和我分支的地方之间的差异)。
If I run git diff --stat --cached origin/foo
then I see a short list of files that looks like what I would expect, e.g. 34 files changed, 1117 insertions(+), 72 deletions(-)
. But when I push it gets to Writing objects: 21% (2317/10804)
and grinds to a halt, as if it's pushing all 2.4GB of binary data.
如果我运行,git diff --stat --cached origin/foo
那么我会看到一个简短的文件列表,看起来像我所期望的,例如34 files changed, 1117 insertions(+), 72 deletions(-)
. 但是当我推动它时,它会Writing objects: 21% (2317/10804)
停下来,好像它正在推动所有 2.4GB 的二进制数据。
Am I missing something (I've googled it pretty hard)? Is this the expected behaviour? I'm using git 2.2.2 on OS X (Mavericks), and ssh ([email protected]).
我是不是遗漏了什么(我在谷歌上搜索得很辛苦)?这是预期的行为吗?我在 OS X (Mavericks) 和 ssh ([email protected]) 上使用 git 2.2.2。
I found a similar question here: Git - pushing a remote branch for a large project is really slowbut no real answers.
我在这里发现了一个类似的问题:Git - 为大型项目推送远程分支真的很慢,但没有真正的答案。
回答by torek
You're using a "smart" transport (this is a good thing), so you do get deltas, or more specifically, "delta compression". But that's not to say that git pushes diffs.
您正在使用“智能”传输(这是一件好事),因此您确实获得了增量,或者更具体地说,是“增量压缩”。但这并不是说 git 会推动差异。
Both push and fetch work the same way here: on a smart transport, your git calls up the remote and both ends have a mini conversation to figure out who has which repository objects, identified by SHA-1 and attached to specific labels (typically branch and tag names although other labels are allowed as well).
push 和 fetch 在这里的工作方式相同:在智能传输上,你的 git 调用远程,两端都有一个小型对话来找出谁拥有哪些存储库对象,由 SHA-1 标识并附加到特定标签(通常是分支和标签名称,尽管也允许使用其他标签)。
For instance, in this case, your git calls up theirs and says: "I propose to have you set your branch master
to SHA-1 1234567...
. I see that your master
is currently 333333...
, here's what I think you need to get from there to 7777777...
." Theirs should reply with "ok, I need some of those but I already have ...". Once your git has figured out what needs to be sent, and what is already present, your git builds a "thin pack"1containing all the to-be-sent objects. (This is the "delta compressing using up to %d threads" phase.)
例如,在这种情况下,您的 git 会调用他们的 git 并说:“我建议您将分支设置master
为 SHA-1 1234567...
。我看到您master
的当前是333333...
,这是我认为您需要从那里到 的内容7777777...
。” 他们的回答应该是“好的,我需要一些,但我已经有了……”。一旦您的 git 弄清楚需要发送的内容以及已经存在的内容,您的 git 就会构建一个“瘦包” 1 ,其中包含所有要发送的对象。(这是“使用最多 %d 个线程进行增量压缩”阶段。)
The resulting thin pack is then sent over the smart transport; this is where you see the "writing objects" messages. (The entire thin pack must be sent successfully, after which the receiver "fattens it up" again using git index-pack --fix-thin
and drops it into the repository.)
然后将生成的薄包装通过智能传输装置发送;这是您看到“写入对象”消息的地方。(整个瘦包必须成功发送,之后接收者再次使用“变胖”git index-pack --fix-thin
并将其放入存储库。)
Exactly what data is sent, depends on the objects in the thin pack. That shouldbe just the set of commits between "what they have" and "what you're sending", plus any objects (trees and blobs) needed for those commits, plus any annotated tags you're sending and any objects needed for those, that they don't already have.
具体发送什么数据取决于瘦包中的对象。这应该只是“他们拥有的”和“您发送的内容”之间的一组提交,加上这些提交所需的任何对象(树和 blob),加上您发送的任何带注释的标签以及那些需要的任何对象,他们还没有。
You can find the commits in question by using git fetch
to pick up their latest information, then using git rev-list
to see what commits you'd send them. For instance, if you're just going to push things on master
:
您可以通过使用git fetch
获取他们的最新信息来找到有问题的提交,然后使用git rev-list
来查看您将向它们发送哪些提交。例如,如果你只是想推动事情master
:
$ git fetch origin # assuming the remote name is origin
[wait for it to finish]
$ git rev-list origin/master..master
Examining these commits may show a very large binary file that is contained in one of the middle ones, then removed again in a later commit:
检查这些提交可能会显示一个非常大的二进制文件,该文件包含在其中一个中间文件中,然后在以后的提交中再次删除:
$ git log --name-status origin/master..master
If one commit has A giantfile.bin
and then a subsequent (probably listed first in git log
output) commit has D giantfile.bin
, you're probably getting hung up sending the blob for giantfile.bin
.
如果一个提交有A giantfile.bin
,然后随后的(可能在git log
输出中首先列出)提交有D giantfile.bin
,则您可能会挂断为giantfile.bin
.
If that's the case, you can use git rebase -i
to eliminate the commit that adds the giant binary file, so that git push
won't have to send that commit.
如果是这种情况,您可以使用git rebase -i
消除添加巨大二进制文件的提交,这样git push
就不必发送该提交。
(If your history is linear—has no merges to push—then you can also, or instead, use git format-patch
to create a series of email messages that contain patches. These are suitable for emailing to someone at the other site—not that there's someone at github waiting to receive them, but you can easily examine the patch files to see if any of them are enormous.)
(如果你的历史是线性的——没有要推送的合并——那么你也可以,或者相反,git format-patch
用来创建一系列包含补丁的电子邮件。这些适合用电子邮件发送给另一个站点的人——而不是有人在github 正在等待接收它们,但您可以轻松检查补丁文件以查看它们中的任何一个是否很大。)
1The pack is "thin" in that it violates a normal pack-file rule that requires any delta-compression "downstream" object to be in the pack itself. Instead, the "downstream" objects can (in fact, must) be in the repository receiving the thin pack.
1包是“瘦”的,因为它违反了正常的包文件规则,该规则要求任何增量压缩“下游”对象都在包本身中。相反,“下游”对象可以(实际上,必须)位于接收瘦包的存储库中。
回答by VonC
Note that Git 2.25 fixes an extreme slowdown in pack-objects when you have more than 1023 packs. See below for numbers.
请注意,当您拥有超过 1023 个包时,Git 2.25 修复了包对象的极度减速。请参阅下面的数字。
That might have a positive influence on your case, where you have a large number of pack files.
这可能会对您的案例产生积极影响,因为您有大量的包文件。
See commit f66e040(11 Nov 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
--in commit 8faff38, 01 Dec 2019)
请参阅Jeff King ( ) 的提交 f66e040(2019 年 11 月 11 日)。(由Junio C Hamano合并-- --在8faff38 提交中,2019 年 12 月 1 日)peff
gitster
pack-objects
: avoid pointlessoe_map_new_pack()
callsSigned-off-by: Jeff King
Reviewed-by: Derrick StoleeSince 43fa44fa3b(pack-objects: move
in_pack
out of structobject_entry,
2018-04-14), we use a complicated system to save some per-object memory.Each
object_entry
structs gets a 10-bit field to store the index of the pack it's in. We map those indices into pointers usingpacking_data->in_pack_by_idx,
which we initialize at the start of the program.
If we have 2^10 or more packs, then we instead create an array of pack pointers, one per object. This ispacking_data->in_pack
.So far so good. But there's one other tricky case: if a new pack arrives after we've initialized
in_pack_by_idx,
it won't have an index yet. We solve that by callingoe_map_new_pack()
, which just switches on the fly to the less-optimalin_pack
mechanism, allocating the array and back-filling it for already-seen objects.But that logic kicks in even when we've switched to it already (whether because we really did see a new pack, or because we had too many packs in the first place). The result doesn't produce a wrong outcome, but it's very slow. What happens is this:
imagine you have a repo with 500k objects and 2000 packs that you want to repack.
before looking at any objects, we call
prepare_in_pack_by_idx()
.
It starts allocating an index for each pack.
On the 1024th pack, it sees there are too many, so it bails, leavingin_pack_by_idx
asNULL
.- while actually adding objects to the packing list, we call
oe_set_in_pack()
, which checks whether the pack already has an index.
If it's one of the packs after the first 1023, then it doesn't have one, and we'll calloe_map_new_pack()
.But there's no useful work for that function to do.
We're already usingin_pack
, so it just uselessly walks over the complete list of objects, trying to backfillin_pack
.And we end up doing this for almost 1000 packs (each of which may be triggered by more than one object). And each time it triggers, we may iterate over up to 500k objects. So in the absolute worst case, this is quadratic in the number of objects.
The solution is simple: we don't need to bother checking whether the pack has an index if we've already converted to using
in_pack,
since by definition we're not going to use it. So we can just push the "does the pack have a valid index" check down into that half of the conditional, where we know we're going to use it.The current test in p5303 sadly doesn't notice this problem, since it maxes out at 1000 packs. If we add a new test to it at 2000 packs, it does show the improvement:
Test HEAD^ HEAD ---------------------------------------------------------------------- 5303.12: repack (2000) 26.72(39.68+0.67) 15.70(28.70+0.66) -41.2%
However, these many-pack test cases are rather expensive to run, so adding larger and larger numbers isn't appealing. Instead, we can show it off more easily by using
GIT_TEST_FULL_IN_PACK_ARRAY,
which forces us into the absolute worst case: no pack has an index, so we'll triggeroe_map_new_pack()
pointlessly for every single object, making it truly quadratic.Here are the numbers (on git.git) with the included change to p5303:
Test HEAD^ HEAD ---------------------------------------------------------------------- 5303.3: rev-list (1) 2.05(1.98+0.06) 2.06(1.99+0.06) +0.5% 5303.4: repack (1) 33.45(33.46+0.19) 2.75(2.73+0.22) -91.8% 5303.6: rev-list (50) 2.07(2.01+0.06) 2.06(2.01+0.05) -0.5% 5303.7: repack (50) 34.21(35.18+0.16) 3.49(4.50+0.12) -89.8% 5303.9: rev-list (1000) 2.87(2.78+0.08) 2.88(2.80+0.07) +0.3% 5303.10: repack (1000) 41.26(51.30+0.47) 10.75(20.75+0.44) -73.9%
Again, those improvements aren't realistic for the 1-pack case (because in the real world, the full-array solution doesn't kick in), but it's more useful to be testing the more-complicated code path.
While we're looking at this issue, we'll tweak one more thing: in
oe_map_new_pack()
, we callREALLOC_ARRAY(pack->in_pack)
. But we'd never expect to get here unless we're back-filling it for the first time, in which case it would beNULL
.
So let's switch that toALLOC_ARRAY()
for clarity, and add a BUG() to document the expectation. Unfortunately this code isn't well-covered in the test suite because it's inherently racy (it only kicks in if somebody else adds a new pack while we're in the middle of repacking).
pack-objects
: 避免无意义的oe_map_new_pack()
调用签字人:Jeff King
评论人:Derrick Stolee由于43fa44fa3b(pack-objects: move
in_pack
out of structobject_entry,
2018-04-14),我们使用一个复杂的系统来节省一些每个对象的内存。每个
object_entry
结构都有一个 10 位字段来存储它所在的包的索引。我们将这些索引映射到packing_data->in_pack_by_idx,
我们在程序开始时初始化的指针中。
如果我们有 2^10 个或更多包,那么我们会创建一个包指针数组,每个对象一个。这是packing_data->in_pack
。到现在为止还挺好。但还有另一种棘手的情况:如果在我们初始化之后有一个新包到达,
in_pack_by_idx,
它还没有索引。我们通过调用 来解决这个问题oe_map_new_pack()
,它只是即时切换到不太理想的in_pack
机制,分配数组并为已经看到的对象回填它。但即使我们已经切换到它(无论是因为我们确实看到了一个新包,还是因为我们一开始有太多的包),这种逻辑仍然存在。结果不会产生错误的结果,但速度非常慢。发生的事情是这样的:
假设您有一个包含 50 万个对象和 2000 个要重新打包的包的存储库。
在查看任何对象之前,我们调用
prepare_in_pack_by_idx()
.
它开始为每个包分配一个索引。
在第 1024 包中,它看到太多了,所以它放弃了,留下in_pack_by_idx
了NULL
.- 在实际将对象添加到装箱单时,我们调用
oe_set_in_pack()
,它检查包是否已经有索引。
如果它是第一个 1023 之后的包之一,则它没有,我们将调用oe_map_new_pack()
.但是该功能没有任何有用的工作要做。
我们已经在使用in_pack
,所以它只是无用地遍历对象的完整列表,试图回填in_pack
。我们最终对近 1000 个包(每个包可能由多个对象触发)执行此操作。每次触发时,我们可能会迭代多达 50 万个对象。所以在绝对最坏的情况下,这是对象数量的二次方。
解决方案很简单:如果我们已经转换为 using
in_pack,
,我们就不需要检查包是否有索引,因为根据定义我们不会使用它。因此,我们可以将“包是否具有有效索引”检查推入条件的那一半,我们知道我们将在那里使用它。遗憾的是,p5303 中的当前测试没有注意到这个问题,因为它最多可容纳 1000 包。如果我们在 2000 包中添加一个新测试,它确实显示了改进:
Test HEAD^ HEAD ---------------------------------------------------------------------- 5303.12: repack (2000) 26.72(39.68+0.67) 15.70(28.70+0.66) -41.2%
然而,这些多包测试用例运行起来相当昂贵,因此添加越来越大的数字并不吸引人。相反,我们可以通过使用
GIT_TEST_FULL_IN_PACK_ARRAY,
which 迫使我们进入绝对最坏的情况来更轻松地展示它:没有包有索引,因此我们将为oe_map_new_pack()
每个单个对象毫无意义地触发,使其真正二次方。以下是包含对 p5303 的更改的数字(在git.git 上):
Test HEAD^ HEAD ---------------------------------------------------------------------- 5303.3: rev-list (1) 2.05(1.98+0.06) 2.06(1.99+0.06) +0.5% 5303.4: repack (1) 33.45(33.46+0.19) 2.75(2.73+0.22) -91.8% 5303.6: rev-list (50) 2.07(2.01+0.06) 2.06(2.01+0.05) -0.5% 5303.7: repack (50) 34.21(35.18+0.16) 3.49(4.50+0.12) -89.8% 5303.9: rev-list (1000) 2.87(2.78+0.08) 2.88(2.80+0.07) +0.3% 5303.10: repack (1000) 41.26(51.30+0.47) 10.75(20.75+0.44) -73.9%
同样,这些改进对于 1-pack 情况是不现实的(因为在现实世界中,全阵列解决方案不会启动),但测试更复杂的代码路径更有用。
在我们研究这个问题的同时,我们还要调整一件事:在 中
oe_map_new_pack()
,我们调用REALLOC_ARRAY(pack->in_pack)
. 但是除非我们第一次回填它,否则我们永远不会期望到达这里,在这种情况下它将是NULL
。
因此,ALLOC_ARRAY()
为了清楚起见,让我们将其切换为 ,并添加一个 BUG() 来记录期望。不幸的是,这段代码在测试套件中没有得到很好的覆盖,因为它本质上是活泼的(只有在我们正在重新打包的过程中其他人添加了一个新包时,它才会启动)。