git 在跟踪大型二进制文件时非常慢

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3055506/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 08:35:25  来源:igfitidea点击:

git is very very slow when tracking large binary files

git

提问by Nick Vanderbilt

My project is six months old and git is very very slow. We track around 30 files which are of size 5 MB to 50 MB. Those are binary files and we keep them in git. I believe those files are making git slow.

我的项目已经六个月了,git 非常非常慢。我们跟踪大约 30 个大小为 5 MB 到 50 MB 的文件。这些是二进制文件,我们将它们保存在 git 中。我相信这些文件会使 git 变慢。

Is there a way to kill all files of size > 5MB from the repository. I know I would lose all of these files and that is okay with me.

有没有办法从存储库中杀死所有大小 > 5MB 的文件。我知道我会丢失所有这些文件,这对我来说没问题。

Ideally I would like a command that would list all the big files ( > 5MB) . I can see the list and then I say okay go ahead and delete those files and make git faster.

理想情况下,我想要一个列出所有大文件 (> 5MB) 的命令。我可以看到列表,然后我说好的继续删除这些文件并使 git 更快。

I should mention that git is slow not only on my machine but deploying the app on staging environment is now taking around 3 hours.

我应该提到 git 不仅在我的机器上很慢,而且在登台环境中部署应用程序现在需要大约 3 个小时。

So the fix should be something that will affect the server and not only the users of repository.

所以修复应该是会影响服务器的东西,而不仅仅是存储库的用户。

回答by kubi

Do you garbage collect?

你收集垃圾吗?

git gc

This makes a significant difference in speed, even for small repos.

即使对于小型回购,这也会在速度上产生显着差异。

回答by Andres Jaan Tack

Explanation

解释

Git is really good at huge histories of small text files because it can store them and their changes efficiently. At the same time, git is very bad at binary files, and will na?vely store separate copies of the file (by default, at least). The repository gets huge, and then it gets slow, as you've observed.

Git 非常擅长处理大量小文本文件的历史,因为它可以有效地存储它们及其更改。同时,git 在二进制文件方面非常糟糕,并且会天真地存储文件的单独副本(至少在默认情况下如此)。正如您所观察到的,存储库变得庞大,然后变慢。

This is a common problem among DVCS's, exacerbated by the fact that you download every version of every file ("the whole repository") every time you clone. The guys at Kilnare working on a plugin to treat these large files more like Subversion, which only downloads historical versions on-demand.

这是 DVCS 中的一个常见问题,每次克隆时都下载每个文件的每个版本(“整个存储库”),这一事实使问题更加严重。在这些家伙正在开发一个插件来处理这些大文件更喜欢颠覆,只下载点播历史版本。

Solution

解决方案

This command will list all files under the current directory of size >= 5MB.

此命令将列出当前目录下大小 >= 5MB 的所有文件。

find . -size +5000000c 2>/dev/null -exec ls -l {} \;

If you want to remove the files from the entire history of the repository, you can use this idea with git filter-branchto walk the history and get rid of all traces of large files. After doing this, all new clones of the repository will be leaner. If you want to lean-up a repository without cloning, you'll find directions on the man page(see "Checklist for Shrinking a Repository").

如果您想从存储库的整个历史记录中删除文件,您可以使用这个想法git filter-branch来遍历历史记录并删除所有大文件的痕迹。这样做之后,存储库的所有新克隆都将变得更精简。如果您想在不克隆的情况下精简存储库,您可以在手册页上找到说明(请参阅“缩小存储库的清单”)。

git filter-branch --index-filter \
    'find . -size +5000000c 2>/dev/null -exec git rm --cached --ignore-unmatch {} \;'

A word of warning: this will make your repository incompatiblewith other clones, because the trees and indices have different files checked in; you won't be able to push or pull from them anymore.

警告:这将使您的存储库与其他克隆不兼容,因为树和索引签入了不同的文件;你将无法再推或拉它们。

回答by John

Here is a censored revision intended to be less negative and inflammatory:

这是一个经过的修订版,旨在减少负面和煽动性:

Git has a well-known weakness when it comes to files that are not line-by-line text files. There is currently no solution, and no plans announced by the core git team to address this. There are workarounds if your project is small, say, 100 MB or so. There exist branches of the git project to address this scalability issue, but these branches are not mature at this time. Some other revision control systems do not have this specific issue. You should consider this issue as just one of many factors when deciding whether to select git as your revision control system.

当涉及到不是逐行文本文件的文件时,Git 有一个众所周知的弱点。目前没有解决方案,核心 git 团队也没有宣布解决这个问题的计划。如果您的项目很小,例如 100 MB 左右,则有一些解决方法。git 项目存在一些分支来解决这个可扩展性问题,但这些分支目前还不成熟。其他一些版本控制系统没有这个特定问题。在决定是否选择 git 作为您的修订控制系统时,您应该将此问题视为众多因素之一。

回答by martin

There is nothing specific about binary files and the way git is handling them. When you add a file to a git repository, a header is added and the file is compressed with zlib and renamed after the SHA1 hash. This is exactly the same regardless of file type. There is nothing in zlib compression that makes it problematic for binary files.

没有关于二进制文件和 git 处理它们的方式的具体内容。当您将文件添加到 git 存储库时,会添加一个标头,并使用 zlib 压缩该文件并在 SHA1 哈希后重命名。无论文件类型如何,这都是完全相同的。zlib 压缩中没有任何内容会导致二进制文件出现问题。

But at some points (pushing, gc) Git start to look at the possibility to delta compress content. If git find files that are similar (filename etc) it is putting them in RAM and starting to compress them together. If you have 100 files and each of them arr say 50Mb it will try to put 5GB in the memory at the same time. To this you have to add some more to make things work. You computer may not have this amount of RAM and it starts to swap. The process takes time.

但是在某些时候(推送,gc)Git 开始考虑增量压缩内容的可能性。如果 git 找到相似的文件(文件名等),它会将它们放在 RAM 中并开始将它们压缩在一起。如果您有 100 个文件并且每个文件都说 50Mb,它将尝试同时将 5GB 放入内存中。为此,您必须添加更多内容才能使事情发挥作用。您的计算机可能没有此数量的 RAM,并且它开始交换。这个过程需要时间。

You can limit the depth of the delta compression so that the process doesn't use that much memory but the result is less efficient compression. (core.bigFileThreshold, delta attribute, pack.window, pack.depth, pack.windowMemory etc)

您可以限制增量压缩的深度,以便该过程不会使用那么多内存,但结果是压缩效率较低。(core.bigFileThreshold、delta 属性、pack.window、pack.depth、pack.windowMemory 等)

So there are lots of thinks you can do to make git work very well with large files.

所以有很多想法可以让 git 很好地处理大文件。

回答by David

One way of speeding things up is to use the --depth 1flag. See the man page for details. I am not a great git guru but I believe this says do the equivalent of a p4 getor an svn get, that is it give you only the latest files only instead of "give me all of the revisions of all the files through out all time" which is what git clonedoes.

加快速度的一种方法是使用--depth 1标志。有关详细信息,请参阅手册页。我不是一个伟大的 git 大师,但我相信这相当于 ap4 get或 an svn get,也就是说它只给你最新的文件,而不是“给我所有文件的所有修订版”,这是做什么git clone

回答by David I.

You can also consider BFG Repo Cleaner as a faster easier way to clean up large files.

您还可以将 BFG Repo Cleaner 视为清理大文件的更快更简单的方法。

https://rtyley.github.io/bfg-repo-cleaner/

https://rtyley.github.io/bfg-repo-cleaner/

回答by sml

have you told git those files are binary?

你告诉过 git 这些文件是二进制的吗?

e.g. added *.ext binaryto your repository's .gitattributes

例如添加*.ext binary到您的存储库的.gitattributes

回答by martin

I have been running Git since 2008 both on windows and GNU/linux and I most of the files I track are binary files. Some of my repos are several GB and contains Jpeg and other media. I have many computers both at home and at work running Git.

自 2008 年以来,我一直在 Windows 和 GNU/linux 上运行 Git,我跟踪的大多数文件都是二进制文件。我的一些存储库有几个 GB,包含 Jpeg 和其他媒体。我在家里和工作中都有很多运行 Git 的计算机。

I have never had the symptoms that are described by the original post. But just a couple of weeks ago I installed MsysGit on an old Win-XP Laptop and almost whatever I did, it brought git to a halt. Even test with just two or three small text files was ridiculously slow. We are talking about 10 minutes to add a file less that 1k... it seems like the git processes stayed alive forever. Everything else worked as expected on this computer.
I downgraded from the latest version to 1.6 something and the problems were gone...
I have other Laptops of the same brand, also with Win-XP installed by the same IT department form the same image, where Git works fine regardless of version... So there must be something odd with that particular computer.

我从来没有出现过原帖所描述的症状。但就在几周前,我在一台旧的 Win-XP 笔记本电脑上安装了 MsysGit,几乎无论我做什么,它都让 git 停止工作。即使只用两三个小文本文件进行测试也慢得离谱。我们正在谈论 10 分钟添加一个小于 1k 的文件......似乎 git 进程永远活着。在这台计算机上,其他一切都按预期工作。
我从最新版本降级到 1.6 的东西,问题消失了......
我有其他相同品牌的笔记本电脑,同样由同一 IT 部门安装的 Win-XP 形成相同的图像,无论版本如何,Git 都能正常工作。 .. 所以那台特定的电脑一定有什么奇怪的地方。

I have also done some tests with binary files and compression. If you have a BMP picture and you make small changes to it and commit them, git gc will compress very well. So my conclusion is that the compression is not depending on if the files are binary or not.

我还对二进制文件和压缩进行了一些测试。如果你有一张 BMP 图片,你对它做了一些小的改动并提交,git gc 会压缩得很好。所以我的结论是压缩不取决于文件是否是二进制文件。

回答by joshlrogers

Just set the files up to be ignored. See the link below:

只需将文件设置为忽略即可。请参阅以下链接:

http://help.github.com/git-ignore/

http://help.github.com/git-ignore/

回答by John

That's because git isn't scalable.

那是因为git 不可扩展。

This is a serious limitation in git that is drowned out by git advocacy. Search the git mailing lists and you'll find hundreds of users wondering why just a meager 100 MB of images (say, for a web site or application) brings git to its knees. The problem appears to be that nearly all of git relies on an optimization they refer to as "packing". Unfortunately, packing is inefficient for all but the smallest text files (i.e., source code). Worse, it grows less and less efficient as the history increases.

这是 git 中的一个严重限制,但被 git 倡导淹没了。搜索 git 邮件列表,您会发现数百名用户想知道为什么只有 100 MB 的微薄图像(例如,对于网站或应用程序)会使 git 屈服。问题似乎是几乎所有的 git 都依赖于他们称之为“打包”的优化。不幸的是,除了最小的文本文件(即源代码)之外,打包对于所有文件都是低效的。更糟糕的是,随着历史的增加,它的效率越来越低。

It's really an embarrassing flaw in git, which is touted as "fast" (despite lack of evidence), and the git developers are well aware of it. Why haven't they fixed it? You'll find responses on the git mailing list from git developers who won't recognize the problem because they Photoshop documents (*.psd) are proprietary format. Yes, it's really that bad.

这确实是 git 一个令人尴尬的缺陷,它被吹捧为“快速”(尽管缺乏证据),而 git 开发人员对此非常了解。他们怎么还没修好?您会在 git 邮件列表中找到来自 git 开发人员的回复,他们无法识别问题,因为他们的 Photoshop 文档 (*.psd) 是专有格式。是的,真的很糟糕。

Here's the upshot:

结果如下:

Use git for tiny, source-code only projects for which you don't feel like setting up a separate repo. Or for small source-code only projects where you want to take advantage of git's copy-the-entire-repo model of decentralized development. Or when you simply want to learn a new tool. All of these are good reasons to use git, and it's always fun to learn new tools.

将 git 用于您不想为其设置单独存储库的仅包含源代码的小型项目。或者对于您想要利用 git 的复制整个存储库的分散开发模型的小型纯源代码项目。或者当您只是想学习一种新工具时。所有这些都是使用 git 的好理由,学习新工具总是很有趣。

Don't use git if you have a large code base, binaries, huge history, etc. Just one of our repos is a TB. Git can't handle it. VSS, CVS, and SVN handle it just fine. (SVN bloats up, though.)

如果您有大型代码库、二进制文件、大量历史记录等,请不要使用 git。我们的存储库之一是 TB。Git无法处理它。VSS、CVS 和 SVN 处理得很好。(虽然 SVN 膨胀了。)

Also, give git time to mature. It's still immature, yet it has a lot of momentum. In time, I think the practical nature of Linus will overcome the OSS purists, and git will eventually be usable in the larger field.

另外,给git成熟的时间。它仍然不成熟,但它有很多动力。随着时间的推移,我认为 Linus 的实用性将克服 OSS 纯粹主义者,而 git 最终将可用于更大的领域。