git 大型二进制文件和> 1TB 存储库的版本控制?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/5234318/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 05:09:59  来源:igfitidea点击:

Version control for large binary files and >1TB repositories?

svngitversion-controlpackaging

提问by Christoph Voigt

Sorry to come up with this topic again, as there are soomanyotherquestions already related - but none that covers my problem directly.

不好意思拿出这个话题再次,因为有许多其他已经相关的问题-但没有直接涉及我的问题。

What I'm searching is a good version control system that can handle only two simple requirements:

我正在寻找的是一个很好的版本控制系统,它只能处理两个简单的要求:

  1. store large binary files (>1GB)
  2. support a repository that's >1TB (yes, that's TB)
  1. 存储大型二进制文件 (>1GB)
  2. 支持大于 1TB 的存储库(是的,那是 TB)

Why? We're in the process of repackaging a few thousand software applications for our next big OS deployment and we want those packages to follow version control.

为什么?我们正在为我们的下一个大型操作系统部署重新打包几千个软件应用程序,我们希望这些包遵循版本控制。

So far I've got some experience with SVN and CVS, however I'm not quite satisfied with the performance of both with large binary files (a few MSI or CAB files will be >1GB). Also, I'm not sure if they scale well with the amount of data we're expecting in the next 2-5 years (like I said, estimated >1TB)

到目前为止,我对 SVN 和 CVS 有一些经验,但是我对两者在大型二进制文件(一些 MSI 或 CAB 文件将大于 1GB)的性能不太满意。另外,我不确定它们是否能很好地适应我们在未来 2-5 年预期的数据量(就像我说的,估计 >1TB)

So, do you have any recommendations? I'm currently also looking into SVN Externals as well as Git Submodules, though that would mean several individual repositories for each software package and I'm not sure that's what we want..

那么,您有什么建议吗?我目前也在研究 SVN Externals 以及 Git Submodules,尽管这意味着每个软件包都有几个单独的存储库,我不确定这是我们想要的。

采纳答案by HardCode

Version control systems are for source code, not binary builds. You are better off just using standard network file server backup tapes for binary file backup - even though it's largely unnecessary when you have source code control since you can just rebuild any version of any binary at any time. Trying to put binaries in source code control is a mistake.

版本控制系统用于源代码,而不是二进制构建。您最好只使用标准网络文件服务器备份磁带进行二进制文件备份 - 即使在您拥有源代码控制时这在很大程度上是不必要的,因为您可以随时重建任何二进制文件的任何版本。试图将二进制文件置于源代码控制中是一个错误。

What you are really talking about is a process known as configuration management. If you have thousands of unique software packages, your business should have a configuration manager (a person, not software ;-) ) who manages all of the configurations (a.k.a. builds) for development, testing, release, release-per-customer, etc.

您真正谈论的是一个称为配置管理的过程。如果您有数千个独特的软件包,您的企业应该有一个配置经理(一个人,而不是软件 ;-))来管理用于开发、测试、发布、每个客户发布等的所有配置(也称为构建) .

回答by Mats Ekberg

Take a look at Boar, "Simple version control and backup for photos, videos and other binary files". It can easily handle huge files and huge repositories.

看看Boar,“照片、视频和其他二进制文件的简单版本控制和备份”。它可以轻松处理巨大的文件和巨大的存储库。

回答by Robert Cowham

Old question, but perhaps worth pointing out that Perforce is in use at lots of large companies, and particular in games development companies, where multi-Terabyte repositories with many large binary files.

老问题,但也许值得指出的是,Perforce 已在许多大公司中使用,特别是在游戏开发公司中,其中具有许多大型二进制文件的多 TB 存储库。

(Disclaimer: I work at Perforce)

(免责声明:我在 Perforce 工作)

回答by VonC

Update May 2017:

2017 年 5 月更新:

Git, with the addition of GVFS (Git Virtual File System), can support virtually any number of files of any size (starting with the Windows repository itself: "The largest Git repo on the planet" (3.5M files, 320GB).
This is not yet >1TB, but it can scale there.

Git,加上GVFS(Git 虚拟文件系统),几乎可以支持任意数量的任意大小的文件(从 Windows 存储库本身开始:“地球上最大的 Git存储库”(3.5M 文件,320GB)。
这个尚未> 1TB,但可以在那里扩展。

The work done with GVFS is slowly proposed upstream (that is to Git itself), but that is still a work in progress.
GVFS is implement on Windows, but will soon be done for Mac (because the team at Windows developing Office for Mac demands it), and Linux.

使用 GVFS 完成的工作在上游(即 Git 本身)慢慢提出,但这仍然是一项正在进行的工作。
GVFS 在 Windows 上实现,但很快就会在 Mac 上实现(因为 Windows 开发 Office for Mac 的团队需要它)和 Linux。



April 2015

2015 年 4 月

Git can actually be considered as a viable VCS for large data, with Git Large File Storage (LFS)(by GitHub, april 2015).

Git 实际上可以被视为大数据的可行 VCS,使用Git 大文件存储 (LFS)(GitHub 于 2015 年 4 月)。

git-lfs(see git-lfs.github.com) can be tested with a server supporting it: lfs-test-server(or directly with github.com itself):
You can store metadata only in the git repo, and the large file elsewhere.

git-lfs(请参阅git-lfs.github.com)可以使用支持它的服务器进行测试:lfs-test-server(或直接使用 github.com 本身):
您只能将元数据存储在 git repo 中,而大型在别处存档。

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

回答by Rudi

When you reallyhave to use a VCS, i would use svn, since svn does not require to copy the entire repository to the working copy. But it still needs about the duplicate amount of disk space, since it has a clean copy for each file.

当您真的必须使用 VCS 时,我会使用 svn,因为 svn 不需要将整个存储库复制到工作副本。但是它仍然需要大约重复的磁盘空间量,因为它为每个文件都有一个干净的副本。

With these amount of data I would look for a document management system, or (low level) use a read-only network share with a defined input process.

有了这些数据量,我会寻找一个文档管理系统,或者(低级别)使用具有定义输入过程的只读网络共享。

回答by bahrep

  • store large binary files (>1GB)
  • support a repository that's >1TB (yes, that's TB)
  • 存储大型二进制文件 (>1GB)
  • 支持大于 1TB 的存储库(是的,那是 TB)

Yep, that is one of the cases Apache Subversion should fully support.

是的,这是 Apache Subversion 应该完全支持的情况之一。

So far I've got some experience with SVN and CVS, however I'm not quite satisfied with the performance of both with large binary files (a few MSI or CAB files will be >1GB). Also, I'm not sure if they scale well with the amount of data we're expecting in the next 2-5 years (like I said, estimated >1TB)

到目前为止,我对 SVN 和 CVS 有一些经验,但是我对两者在大型二进制文件(一些 MSI 或 CAB 文件将大于 1GB)的性能不太满意。另外,我不确定它们是否能很好地适应我们在未来 2-5 年预期的数据量(就像我说的,估计 >1TB)

Up-to-date Apache Subversion servers and clients should have no problems controlling such amount of data and they perfectly scale. Moreover, there are various repository replication approaches that should improve performance in case you have multiple sites with developers working on the same projects.

最新的 Apache Subversion 服务器和客户端在控制如此大量的数据时应该没有问题,并且它们可以完美地扩展。此外,有多种存储库复制方法可以提高性能,以防您有多个站点且开发人员正在处理相同的项目。

I'm currently also looking into SVN Externals as well as Git Submodules, though that would mean several individual repositories for each software package and I'm not sure that's what we want..

我目前也在研究 SVN Externals 以及 Git Submodules,尽管这意味着每个软件包都有几个单独的存储库,我不确定这是我们想要的。

svn:externalshave nothing to do with the support for large binaries or multiterabyte projects. Subversion perfectly scales and supports very large data and code base in a single repository. But Git does not. With Git, you'll have to divide and split the projects to multiple small repositories. This is going to lead to a lot of drawbacks and a constant PITA. That's why Git has a lot of add-ons such as git-lfs that try to make the problem less painful.

svn:externals与对大型二进制文件或多 TB 项目的支持无关。Subversion 在单个存储库中完美地扩展和支持非常大的数据和代码库。但是 Git没有使用 Git,您必须将项目划分和拆分为多个小型存储库。这将导致很多缺点和持续的 PITA。这就是为什么 Git 有很多附加组件,例如 git-lfs,它们试图让问题变得不那么痛苦。

回答by conny

You might be much better off by simply relying on some NAS device that would provide a combination of filesystem-accessible snapshotstogether with single instance store / block level deduplication, given the scale of data you are describing ...

考虑到您所描述的数据规模,简单地依赖一些 NAS 设备可能会好得多,这些设备将提供文件系统可访问快照与单实例存储/块级重复数据删除的组合......

(The question also mentions .cab & .msi files: usually the CI softwareof your choice has some method of archiving builds. Is that what you are ultimately after?)

(这个问题还提到了 .cab 和 .msi 文件:通常你选择的CI 软件有一些归档构建的方法。这就是你最终想要的吗?)

回答by jfriedmanlex

There are a couple of companies with products for "Wide Area File Sharing." They can replicate large files to different locations, but have distributed locking mechanisms so only one person can work on any of the copies. When a person checks in an updated copy, that is replicated to the other sites. The major use is CAD/CAM files and other large files. See Peer Software (http://www.peersoftware.com/index.aspx) and GlobalSCAPE (http://www.globalscape.com/).

有几家公司提供“广域文件共享”产品。他们可以将大文件复制到不同的位置,但具有分布式锁定机制,因此只有一个人可以处理任何副本。当一个人签入更新的副本时,它会被复制到其他站点。主要用途是 CAD/CAM 文件和其他大文件。请参阅 Peer Software (http://www.peersoftware.com/index.aspx) 和 GlobalSCAPE (http://www.globalscape.com/)。

回答by gregsohl

This is an old question, but one possible answer is https://www.plasticscm.com/. Their VCS can handle very large files and very large repositories. They were my choice when we were choosing a couple years ago, but management pushed us elsewhere.

这是一个老问题,但一个可能的答案是https://www.plasticscm.com/。他们的 VCS 可以处理非常大的文件和非常大的存储库。几年前我们选择时,他们是我的选择,但管理层将我们推向了别处。

回答by Arrowmaster

The perks that come with a versioning system (changelog, easy rss access etc.) are nonexistant on a simple fileshare.

版本控制系统(更改日志、轻松的 rss 访问等)带来的好处在简单的文件共享中是不存在的。

If you only care about the versioning metadata features and don't actually care about the old data then a solution that uses a VCS without storing the data in the VCS may be an acceptable option.

如果您只关心版本控制元数据功能而实际上并不关心旧数据,那么使用 VCS 而不将数据存储在 VCS 中的解决方案可能是可以接受的选择。

git-annexis the first one that came to my mind, but from the what git-annex is notpage it seems there are other similar but not exactly the same alternatives.

git-annex是我想到的第一个,但是从git-annex 不是页面看来,似乎还有其他类似但不完全相同的替代方案。

I have not used git-annex, but from the description and walkthrough it sounds like it could work for your situation.

我没有使用过 git-annex,但从描述和演练来看,它似乎适合您的情况。