git 不同的版本控制系统如何处理二进制文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/6598700/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How do different version control systems handle binary files?
提问by Tower
I have heard some claims that SVN handles binary files better than Git/Mercurial. Is this true and if so then why? As far as I can imagine, no version control system (VCS) can diff and merge changes between two revisions of the same binary resources.
我听说过一些声称 SVN 处理二进制文件比 Git/Mercurial 更好的说法。这是真的吗?如果是,为什么?据我想象,没有版本控制系统 (VCS) 可以区分和合并相同二进制资源的两个修订版之间的更改。
So, aren't all VCS's bad at handling binary files? I am not very aware of the technical details behind particular VCS implementations so maybe they have some pros and cons.
那么,不是所有的 VCS 都不擅长处理二进制文件吗?我不太了解特定 VCS 实现背后的技术细节,所以也许它们有一些优点和缺点。
采纳答案by VonC
The main pain point is in the "Distributed" aspect of any DVCS: you are cloning everything(the all history of all files)
主要的痛点在于任何 DVCS 的“分布式”方面:您正在克隆所有内容(所有文件的所有历史记录)
Since binaries aren't stored in delta for most of them, and aren't compressed as well as text file, if you are storing rapidly evolving binaries, you end up quickly with a largerepository which becomes much cumbersome to move around (push/pull).
由于大多数二进制文件没有存储在增量中,并且没有像文本文件那样压缩,如果您存储快速发展的二进制文件,您最终会很快得到一个大型存储库,移动起来变得非常麻烦(推/拉)。
For Git for instance, see What are the git limits?.
例如,对于 Git,请参阅git 限制是什么?.
Binaries aren't a good fit for the feature a VCS can bring (diff, branch, merge), and are better managed in an artifact repository (like a Nexusfor example).
This is not necessary the case for a CVCS (Centralized VCS) where the repository could play that role and be a storage for binaries (even if its not its primary role)
二进制文件不适合 VCS 可以带来的功能(差异、分支、合并),并且在工件存储库中更好地管理(例如Nexus)。
对于 CVCS(集中式 VCS)来说,这不是必需的,其中存储库可以扮演该角色并作为二进制文件的存储(即使它不是其主要角色)
回答by martin
One clarification about git and binary files.
关于 git 和二进制文件的一项说明。
Git is compressing binary files as well as text files. So git is not crap at handling binary files as someone suggested.
Git 正在压缩二进制文件和文本文件。所以 git 不像有人建议的那样处理二进制文件。
Any file that Git adds will be compressed into loose objects. It doesn't matter if they are binary or text. If you have a binary or text file and you commit it, the repository will grow. If you make a minor change to the file and commit again your repository will grow again at approximately the same amount depending on the compression ratio.
Git 添加的任何文件都将被压缩为松散对象。它们是二进制还是文本都没有关系。如果您有一个二进制文件或文本文件并提交它,存储库将会增长。如果您对文件进行微小更改并再次提交,您的存储库将再次以大致相同的数量增长,具体取决于压缩率。
Then you make a git gc
. Git will find similarities in the binary or text files and compress them together. You will have a good compression if the similarities are large.
If, on the other hand there are no similarities between the files, you will not have much of a gain compressing them together compared to compressing them individually.
然后你做一个git gc
. Git 会在二进制或文本文件中找到相似之处并将它们压缩在一起。如果相似度很大,您将获得良好的压缩效果。另一方面,如果文件之间没有相似之处,与单独压缩它们相比,将它们压缩在一起不会有太大的好处。
Here is a test with a bit-mapped picture (binary) that I changed a little:
这是一个带有位图图片(二进制)的测试,我做了一些改动:
martin@martin-laptop:~/testing123$ git init
Initialized empty Git repository in /home/martin/testing123/.git/
martin@martin-laptop:~/testing123$ ls -l
total 1252
-rw------- 1 martin martin 1279322 Jan 8 22:42 pic.bmp
martin@martin-laptop:~/testing123$ git add .
martin@martin-laptop:~/testing123$ git commit -a -m first
[master (root-commit) 53886cf] first
1 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 pic.bmp
// here is the size:
martin@martin-laptop:~/testing123$ du -s .git
1244 .git
// Changed a few pixels in the picture
martin@martin-laptop:~/testing123$ git add .
martin@martin-laptop:~/testing123$ git commit -a -m second
[master da025e1] second
1 files changed, 0 insertions(+), 0 deletions(-)
// here is the size:
martin@martin-laptop:~/testing123$ du -s .git
2364 .git
// As you can see the repo is twice as large
// Now we run git gc to compress
martin@martin-laptop:~/testing123$ git gc
Counting objects: 6, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (6/6), done.
Total 6 (delta 1), reused 0 (delta 0)
// here is the size after compression:
martin@martin-laptop:~/testing123$ du -s .git
1236 .git
// we are back to a smaller size than ever...
回答by David W.
Git and Mercurial both handle binary files with aplomb. Thet don't corrupt them, and you can check them in and out. The problem is one of size.
Git 和 Mercurial 都可以从容地处理二进制文件。不要破坏它们,您可以将它们检入和检出。问题是大小之一。
Source usually takes up less room than binary files. You might have 100K of source files that build a 100Mb binary. Thus, storing a single build in my repository could cause it to grow 30 times its size.
源文件通常比二进制文件占用更少的空间。您可能有 100K 的源文件来构建 100Mb 的二进制文件。因此,在我的存储库中存储单个构建可能会导致其大小增长 30 倍。
And it's even worse:
更糟糕的是:
Version control systems usually store files via some form of diff format. Let's say I have a file of 100 lines and each line averages about 40 characters. That entire file is 4K in size. If I change a line in that file, and save that change, I'm only adding about 60 bytes to the size of my repository.
版本控制系统通常通过某种形式的差异格式存储文件。假设我有一个 100 行的文件,每行平均大约 40 个字符。整个文件的大小为 4K。如果我更改该文件中的一行并保存该更改,我只会将存储库的大小增加大约 60 个字节。
Now, let's say I compiled and added that 100Mb file. I make a change in my source (maybe 10K or so in changes), recompile, and store the new binary build. Well, binaries don't usually diff very well, so it's very likely I'm adding another 100Mb of size to my repository. Do a few builds, and my repository size grows to several gigabytes in size, yet the source portion of my repository is till only a few dozen kilobytes.
现在,假设我编译并添加了 100Mb 的文件。我对我的源代码进行了更改(可能更改了 10K 左右),重新编译并存储了新的二进制版本。嗯,二进制文件通常不会有很好的差异,所以我很可能会在我的存储库中再添加 100Mb 的大小。进行几次构建,我的存储库大小增长到几 GB,但我的存储库的源部分只有几十 KB。
The problem with Git and Mercurial is that you normally checkout the entire repository onto your system. Instead of merely downloading a few dozen kilobytes that can be transfered in a few seconds, I am now downloading several gigabytes of builds along with the few dozen kilobytes of data.
Git 和 Mercurial 的问题在于您通常将整个存储库检出到您的系统上。我现在不仅下载可以在几秒钟内传输的几十 KB 的数据,还下载了几 GB 的构建以及几十 KB 的数据。
Maybe people say Subversion is better since I can simply checkout the version I want in Subversion and not download the whole repository. However, Subversion doesn't give you an easy way to remove obsolete binaries from your repository, so your repository will grow and grow anyway. I still don't recommend it. Heck, I don't even recommend it even if the revision control system does allow you to remove old revisions of obsolete binaries. (Perforce, ClearCase, and CVS all do). It's just ends up being a big maintenance headache.
也许人们会说 Subversion 更好,因为我可以简单地在 Subversion 中签出我想要的版本,而不是下载整个存储库。然而,Subversion 并没有给您一个简单的方法来从您的存储库中删除过时的二进制文件,因此您的存储库无论如何都会增长和增长。我还是不推荐。哎呀,即使版本控制系统确实允许您删除过时二进制文件的旧版本,我什至不推荐它。(Perforce、ClearCase 和 CVS 都可以)。它最终成为一个很大的维护难题。
Now, this isn't to say you shouldn't store anybinary files. For example, if I am making a web page, I probably have some gifs and jpegs that I need. No problem storing those in either Subversion or Git/Mercurial. They're relatively small, and probably change a lot less than my code itself.
现在,这并不是说您不应该存储任何二进制文件。例如,如果我正在制作一个网页,我可能有一些我需要的 gif 和 jpeg。将它们存储在 Subversion 或 Git/Mercurial 中没有问题。它们相对较小,并且可能比我的代码本身更改少得多。
What you shouldn't store are built objects. These should be stored in a release repository and fetched as needed. Maven and Ant w/ Ivy does a great job of this. And, you can use the Maven repository structure in C, C++, and C# projects too.
您不应该存储的是构建对象。这些应该存储在发布存储库中并根据需要获取。Maven 和 Ant w/ Ivy 在这方面做得很好。而且,您也可以在 C、C++ 和 C# 项目中使用 Maven 存储库结构。
回答by robert
In Subversion you can lockbinary files to make sure that nobody else can edit them. This mostly assures you that nobody else will modify that binary file while you have it locked. Distributed VCSs don't (and can't) have locks--there's no central repository for them to be registered at.
在 Subversion 中,您可以锁定二进制文件以确保其他人无法编辑它们。这主要是向您保证,当您锁定该二进制文件时,没有其他人会修改该文件。分布式 VCS 没有(也不能)有锁——没有中央存储库供它们注册。
回答by jforberg
Text files have a natural line-oriented struture that binary files lack. This is why it's harder to compare them using common text tools (diff). While it should be possible, the advantage of readability (the reason we use text as our preferred format in the first place) would be lost when applying diffs to binary files.
文本文件具有二进制文件所缺乏的自然面向行的结构。这就是为什么使用常见的文本工具 (diff) 比较它们比较困难的原因。虽然这应该是可能的,但在对二进制文件应用差异时,可读性的优势(我们首先使用文本作为首选格式的原因)将丢失。
As to your suggestion that all version control systems "are crap at handling binary files", I don't know. In principle, there's no reason why a binary file should be slower to process. I would rather say that the advantages of using a VCS (tracking, diffs, overview) are more apparent when handling text files.
至于你的所有版本控制系统“在处理二进制文件方面都是垃圾”的建议,我不知道。原则上,没有理由为什么二进制文件的处理速度会变慢。我宁愿说在处理文本文件时使用 VCS(跟踪、差异、概览)的优势更加明显。