带有大文件的 Git

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17888604/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 16:39:13  来源:igfitidea点击:

Git with large files

gitlarge-filesgitlab

提问by Jakub Riedl

Situation

情况

I have two servers, Production and Development. On Production server, there are two applications and multiple (6) databases (MySQL) which I need to distribute to developers for testing. All source codes are stored in GitLabon Development server and developers are working only with this server and don't have access to production server. When we release an application, master logs into production and pulls new version from Git. The databases are large (over 500M each and counting) and I need to distribute them as easy as possible to developers for testing.

我有两台服务器,生产和开发。在生产服务器上,我需要将两个应用程序和多个 (6) 数据库 (MySQL) 分发给开发人员进行测试。所有源代码都存储在开发服务器上的GitLab 中,开发人员仅使用此服务器工作,无权访问生产服务器。当我们发布应用程序时,master 会登录到生产环境并从 Git 中拉取新版本。数据库很大(每个超过 500M 并且还在增加),我需要尽可能轻松地将它们分发给开发人员进行测试。

Possible solutions

可能的解决方案

  • After a backup script which dumps databases, each to a single file, execute a script which pushes each database to its own branch. A developer pulls one of these branches if he wants to update his local copy.

    This one was found non working.

  • Cron on production server saves binary logs every day and pushes them into the branch of that database. So, in the branch, there are files with daily changes and developer pulls the files he doesn't have. The current SQL dump will be sent to the developer another way. And when the size of the repository becomes too large, we will send full dump to the developers and flush all data in the repository and start from the beginning.

  • 在将数据库转储到单个文件的备份脚本之后,执行将每个数据库推送到其自己的分支的脚本。如果开发人员想要更新他的本地副本,他会拉取这些分支之一。

    这个被发现不起作用。

  • 生产服务器上的 Cron 每天保存二进制日志并将它们推送到该数据库的分支中。因此,在分支中,每天都会有文件发生变化,开发人员会拉取他没有的文件。当前的 SQL 转储将以另一种方式发送给开发人员。当存储库的大小变得太大时,我们将向开发人员发送完整转储并刷新存储库中的所有数据并从头开始。

Questions

问题

  • Is the solution possible?
  • If git is pushing/pulling to/from repository, does it upload/download whole files, or just changes in them (i.e. adds new lines or edits the current ones)?
  • Can Git manage so large files?No.
  • How to set how many revisions are preserved in a repository?Doesn't matter with the new solution.
  • Is there any better solution? I don't want to force the developers to download such large files over FTP or anything similar.
  • 解决方案是否可行?
  • 如果 git 正在向/从存储库推/拉,它是上传/下载整个文件,还是只是更改它们(即添加新行或编辑当前行)?
  • Git 能管理这么大的文件吗?不。
  • 如何设置存储库中保留多少修订?与新解决方案无关。
  • 有没有更好的解决办法?我不想强迫开发人员通过 FTP 或类似方式下载如此大的文件。

采纳答案by PeterSW

rsynccould be a good option for efficiently updating the developers copies of the databases.

rsync可能是有效更新数据库的开发人员副本的好选择。

It uses a delta algorithm to incrementally update the files. That way it only transfers the blocks of the file that have changed or that are new. They will of course still need to download the full file first but later updates would be quicker.

它使用增量算法来增量更新文件。这样它只传输已更改或新的文件块。他们当然仍需要先下载完整文件,但以后更新会更快。

Essentially you get a similar incremental update as a git fetch without the ever expanding initial copy that the git clone would give. The loss is not having the history but is sounds like you don't need that.

本质上,您可以获得与 git fetch 类似的增量更新,而无需 git clone 会提供的不断扩展的初始副本。损失不是没有历史,但听起来你不需要那个。

rsync is a standard part of most linux distributions if you need it on windows there is a packaged port available: http://itefix.no/cwrsync/

rsync 是大多数 linux 发行版的标准部分,如果您在 Windows 上需要它,有一个可用的打包端口:http: //itefix.no/cwrsync/

To push the databases to a developer you could use a command similar to:

要将数据库推送给开发人员,您可以使用类似于以下内容的命令:

rsync -avz path/to/database(s) HOST:/folder

Or the developers could pull the database(s) they need with:

或者开发人员可以使用以下方式提取他们需要的数据库:

rsync -avz DATABASE_HOST:/path/to/database(s) path/where/developer/wants/it

回答by VonC

Update 2017:

2017 年更新:

Microsoft is contributing to Microsoft/GVFS: a Git Virtual File System which allows Git to handle "the largest repo on the planet"
(ie: the Windows code base, which is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB, and produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds)

微软正在为Microsoft/GVFS做出贡献:一个 Git 虚拟文件系统,它允许 Git 处理“地球上最大的存储库
(即:Windows 代码库,大约有 350 万个文件,当签入 Git 存储库时,产生大约 300GB 的存储库,除了数千个拉取请求验证构建之外,每天还会在 440 个分支中生成 1,760 个“实验室构建”)

GVFS virtualizes the file system beneath your git repo so that git and all tools see what appears to be a normal repo, but GVFS only downloads objects as they are needed.

GVFS 虚拟化您的 git repo 下的文件系统,以便 git 和所有工具看到看起来是正常的 repo,但 GVFS 仅在需要时下载对象。

Some parts of GVFS might be contributed upstream (to Git itself).
But in the meantime, all new Windows development is now (August 2017) on Git.

GVFS 的某些部分可能会贡献给上游(Git 本身)。
但与此同时,所有新的 Windows 开发现在(2017 年 8 月)都在 Git 上进行



Update April 2015: GitHub proposes: Announcing Git Large File Storage (LFS)

2015 年 4 月更新:GitHub 提议:宣布 Git 大文件存储 (LFS)

Using git-lfs(see git-lfs.github.com) and a server supporting it: lfs-test-server, you can store metadata only in the git repo, and the large file elsewhere.

使用git-lfs(请参阅git-lfs.github.com)和支持它的服务器:lfs-test-server,您只能将元数据存储在 git repo 中,而将大文件存储在其他地方。

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

https://cloud.githubusercontent.com/assets/1319791/7051226/c4570828-ddf4-11e4-87eb-8fc165e5ece4.gif

See git-lfs/wiki/Tutorial:

参见git-lfs/wiki/教程

git lfs track '*.bin'
git add .gitattributes "*.bin"
git commit -m "Track .bin files"


Original answer:

原答案:

Regarding what the git limitations with large files are, you can consider bup(presented in details in GitMinutes #24)

关于大文件的 git 限制是什么,您可以考虑bup(在GitMinutes #24中有详细介绍)

The design of buphighlights the three issues that limits a git repo:

bup设计突出了限制 git repo 的三个问题:

  • huge files(the xdelta for packfileis in memory only, which isn't good with large files)
  • huge number of file, which means, one file per blob, and slow git gcto generate one packfile at a time.
  • huge packfiles, with a packfile index inefficient to retrieve data from the (huge) packfile.
  • 大文件packfile 的 xdelta仅在内存中,不适用于大文件)
  • 大量文件,这意味着每个 blob 一个文件,并且git gc一次生成一个包文件的速度很慢。
  • 巨大的包文件,带有包文件索引,无法从(巨大的)包文件中检索数据。


Handling huge files and xdelta

处理大文件和 xdelta

The primary reason git can't handle huge files is that it runs them through xdelta, which generally means it tries to load the entire contents of a file into memory at once.
If it didn't do this, it would have to store the entire contents of every single revision of every single file, even if you only changed a few bytes of that file.
That would be a terribly inefficient use of disk space
, and git is well known for its amazingly efficient repository format.

Unfortunately, xdeltaworks great for small files and gets amazingly slow and memory-hungry for large files.
For git's main purpose, ie. managing your source code, this isn't a problem.

What bup does instead of xdelta is what we call "hashsplitting."
We wanted a general-purpose way to efficiently back up anylarge file that might change in small ways, without storing the entire file every time. We read through the file one byte at a time, calculating a rolling checksum of the last 128 bytes.

rollsumseems to do pretty well at its job. You can find it in bupsplit.c.
Basically, it converts the last 128 bytes read into a 32-bit integer. What we then do is take the lowest 13 bits of the rollsum, and if they're all 1's, we consider that to be the end of a chunk.
This happens on average once every 2^13 = 8192 bytes, so the average chunk size is 8192 bytes.
We're dividing up those files into chunks based on the rolling checksum.
Then we store each chunk separately (indexed by its sha1sum) as a git blob.

With hashsplitting, no matter how much data you add, modify, or remove in the middle of the file, all the chunks beforeand afterthe affected chunk are absolutely the same.
All that matters to the hashsplitting algorithm is the 32-byte "separator" sequence, and a single change can only affect, at most, one separator sequence or the bytes between two separator sequences.
Like magic, the hashsplit chunking algorithm will chunk your file the same way every time, even without knowing how it had chunked it previously.

The next problem is less obvious: after you store your series of chunks as git blobs, how do you store their sequence? Each blob has a 20-byte sha1 identifier, which means the simple list of blobs is going to be 20/8192 = 0.25%of the file length.
For a 200GB file, that's 488 megs of just sequence data.

We extend the hashsplit algorithm a little further using what we call "fanout." Instead of checking just the last 13 bits of the checksum, we use additional checksum bits to produce additional splits.
What you end up with is an actual tree of blobs - which git 'tree' objects are ideal to represent.

git 不能处理大文件的主要原因是它运行它们xdelta,这通常意味着它试图一次将文件的全部内容加载到内存中
如果它不这样做,它就必须存储每个文件的每个修订版的全部内容,即使您只更改了该文件的几个字节。
那将是非常低效的磁盘空间使用
,而 git 以其惊人高效的存储库格式而闻名。

不幸的是,它xdelta适用于小文件,但对于大文件却非常缓慢且占用内存
对于 git 的主要目的,即。管理您的源代码,这不是问题。

bup 代替 xdelta 所做的就是我们所说的“ hashsplitting.”。
我们想要一种通用的方法来有效地备份任何可能以小方式更改的大文件,而无需每次都存储整个文件。我们一次读取一个字节,计算最后 128 个字节的滚动校验和。

rollsum似乎在它的工作中做得很好。你可以在bupsplit.c.
基本上,它将读取的最后 128 个字节转换为 32 位整数。然后我们要做的是取 rollsum 的最低 13 位,如果它们都是 1,我们认为这是块的结尾。
这种情况平均每 发生一次2^13 = 8192 bytes,因此平均块大小为 8192 字节。
我们根据滚动校验和将这些文件分成块。
然后我们将每个块单独存储(由其 sha1sum 索引)作为 git blob。

随着hashsplitting,不管你有多少数据添加,修改或删除的文件中,所有块之前之后受影响的块是绝对相同的。
对 hashsplitting 算法来说,重要的是 32 字节的“分隔符”序列,一次更改最多只能影响一个分隔符序列或两个分隔符序列之间的字节。
就像魔术一样,hashsplit 分块算法每次都会以相同的方式分块您的文件,即使不知道它之前是如何分块的。

下一个问题就不那么明显了:在将一系列块存储为 git blob 之后,如何存储它们的序列?每个 blob 都有一个 20 字节的 sha1 标识符,这意味着 blob 的简单列表将是20/8192 = 0.25%文件长度。
对于 200GB 的文件,仅序列数据就有 488 兆。

我们使用所谓的“扇出”进一步扩展了 hashsplit 算法。我们不只检查校验和的最后 13 位,而是使用额外的校验和位来产生额外的拆分。
您最终得到的是一棵实际的 blob 树 - git 'tree' 对象非常适合表示。

Handling huge numbers of files and git gc

处理大量文件和 git gc

git is designed for handling reasonably-sized repositories that change relatively infrequently. You might think you change your source code "frequently" and that git handles much more frequent changes than, say, svncan handle.
But that's not the same kind of "frequently" we're talking about.

The #1 killer is the way it adds new objects to the repository: it creates one file per blob. Then you later run 'git gc' and combine those files into a single file(using highly efficient xdelta compression, and ignoring any files that are no longer relevant).

'git gc' is slow, but for source code repositories, the resulting super-efficient storage (and associated really fast access to the stored files) is worth it.

bupdoesn't do that. It just writes packfiles directly.
Luckily, these packfiles are still git-formatted, so git can happily access them once they're written.

git 旨在处理相对不经常更改的合理大小的存储库。您可能认为您“经常”更改源代码,并且 git 处理的更改比svn可以处理的要频繁得多。
但这与我们所说的“经常”不同。

#1 杀手是它向存储库添加新对象的方式:它为每个 blob 创建一个文件。然后您稍后运行 'git gc' 并将这些文件合并为一个文件(使用高效的 xdelta 压缩,并忽略任何不再相关的文件)。

' git gc' 很慢,但对于源代码存储库来说,由此产​​生的超高效存储(以及对存储文件的快速访问)是值得的。

bup不这样做。它只是直接写入包文件。
幸运的是,这些包文件仍然是 git 格式的,所以一旦它们被写入,git 就可以愉快地访问它们。

Handling huge repository (meaning huge numbers of huge packfiles)

处理巨大的存储库(意味着大量巨大的包文件)

Git isn't actually designed to handle super-huge repositories.
Most git repositories are small enough that it's reasonable to merge them all into a single packfile, which 'git gc' usually does eventually.

The problematic part of large packfiles isn't the packfiles themselves - git is designed to expect the total size of all packs to be larger than available memory, and once it can handle that, it can handle virtually any amount of data about equally efficiently.
The problem is the packfile indexes (.idx) files.

each packfile (*.pack) in git has an associated idx(*.idx) that's a sorted list of git object hashes and file offsets.
If you're looking for a particular object based on its sha1, you open the idx, binary search it to find the right hash, then take the associated file offset, seek to that offset in the packfile, and read the object contents.

The performance of the binary searchis about O(log n)with the number of hashes in the pack, with an optimized first step (you can read about it elsewhere) that somewhat improves it to O(log(n)-7).
Unfortunately, this breaks down a bit when you have lotsof packs.

To improve performance of this sort of operation, bup introduces midx(pronounced "midix" and short for "multi-idx") files.
As the name implies, they index multiple packs at a time.

Git 实际上并不是为处理超大的存储库而设计的
大多数 git 存储库都足够小,因此将它们全部合并到一个包文件中是合理的,' git gc' 通常最终会这样做。

大包文件的问题部分不是包文件本身——git 旨在期望所有包的总大小大于可用内存,一旦它可以处理,它几乎可以同样有效地处理任何数量的数据。
问题在于 packfile 索引 ( .idx) 文件

*.packgit 中的每个包文件 ( ) 都有一个关联的idx( *.idx),它是 git 对象哈希和文件偏移的排序列表。
如果您要根据其 sha1 查找特定对象,则打开 idx,对其进行二进制搜索以找到正确的哈希值,然后获取关联的文件偏移量,在包文件中查找该偏移量,并读取对象内容。

二进制搜索的性能大约是O(log n)在群雄,具有优化的第一步散列的数量(你可以在其他地方了解吧)多少有些改善它O(log(n)-7)
不幸的是,当你有很多包时,这会有点崩溃

为了提高此类操作的性能,bup 引入了midx(发音为“midix”和“multi-idx”的缩写)文件。
顾名思义,它们一次索引多个包。

回答by Amber

You really, really, really do not want large binary files checked into your Git repository.

您真的,真的,真的不希望将大型二进制文件签入您的 Git 存储库。

Each update you add will cumulatively add to the overall size of your repository, meaning that down the road your Git repo will take longer and longer to clone and use up more and more disk space, because Git stores the entire history of the branch locally, which means when someone checks out the branch, they don't just have to download the latest version of the database; they'll also have to download every previous version.

您添加的每个更新都会累积增加您存储库的整体大小,这意味着您的 Git 存储库将花费越来越长的时间来克隆和使用越来越多的磁盘空间,因为 Git 在本地存储分支的整个历史记录,这意味着当有人签出分支时,他们不仅需要下载最新版本的数据库;他们还必须下载以前的每个版本。

If you need to provide large binary files, upload them to some server separately, and then check in a text file with a URL where the developer can download the large binary file. FTP is actually one of the betteroptions, since it's specifically designed for transferring binary files, though HTTP is probably even more straightforward.

如果你需要提供大的二进制文件,单独上传到某个服务器,然后签入一个带有开发者可以下载大二进制文件的URL的文本文件。FTP 实际上是更好的选择之一,因为它是专门为传输二进制文件而设计的,尽管 HTTP 可能更直接。

回答by VonC

You can look at solution like git-annex, which is about managing (big) files with git, without checking the file contents into git(!)
(Feb 2015: a service hosting like GitLab integrates it natively:
See "Does GitLab support large files via git-annexor otherwise?")

您可以查看git-annex 之类的解决方案,它是关于使用 git 管理(大)文件,而无需将文件内容检查到 git(!)
(2015 年 2 月:像 GitLab 这样的托管服务本地集成它
请参阅“ GitLab 支持大文件吗?通过git-annex或其他方式文件?”)

git doesn't manage big files, as explained by Amberin her answer.

git 不管理大文件,正如Amber她的回答中所解释的那样。

That doesn't mean git won't be able to do better one day though.
From GitMinutes episode 9(May 2013, see also below), From Peff (Jeff King), at 36'10'':

不过,这并不意味着 git 有一天不会做得更好。
来自GitMinutes 第 9 集2013 年 5 月,另见下文),来自Peff (Jeff King),36'10'':

(transcript)

(成绩单)

There is a all other realm of large repositories where people are interested in storing, you know, 20 or 30 or 40 GB, sometime even TB-sized repositories, and yeah it comes from having a lot of files, but a lot of it comes from having really big files and really big binaries files that don't deal so well with each others.

That's sort of an open problem. There are a couple solutions: git-annex is probably the most mature of those, where they basically don't put the asset into git, they put the large asset on an asset server, and put a pointerinto git.

I'd like to do something like that, where the asset is conceptuallyin git, that is the SHA1 of that object is part of the SHA1 that goes into the tree, that goes into the commit ID and all those things.
So from git perspective, it is part of the repository, but at a level below, at the object storage level, at a level below the conceptualhistory graph, where we already have multiple way of storing an object: we have loose objects, we have packed objects, I'd like to have maybe a new way of storing an object which is to say "we don't have it here, but it is available by an asset server", or something like that.

(Thomas Ferris Nicolaisen) Oh cool...

The problem with things like git-annexis: once you use them, you're... locked-in to the decisions you made at that time forever. You know, that if you decide oh 200 MB is big, and we are gonna store on an asset server, and then, later you decide, aah it should have been 300 MB, well tough luck: that's encoded in your history forever.
And so by saying conceptually, at the git level, this object is inthe git repository, not some pointer to it, not some pointer to an asset server, the actualobject is there, and then taking care of those details at a low-level, at the storage level, then that frees you up to make a lot of different decisions, and even changeyour decision later about how you actually want to store the stuff on disk.

还有一个大型存储库的所有其他领域,人们对存储感兴趣,你知道,20、30 或 40 GB,有时甚至是 TB 大小的存储库,是的,它来自拥有大量文件,但其中很多来自从拥有非常大的文件和非常大的二进制文件,它们彼此之间不能很好地处理。

这是一个开放的问题。有几个解决方案:git-annex 可能是其中最成熟的,他们基本上不将资产放入 git,而是将大资产放在资产服务器上,并将指针放入 git。

我想做这样的事情,资产在概念上是在 git 中,即该对象的 SHA1 是进入树的 SHA1 的一部分,进入提交 ID 和所有这些。
所以从 git 的角度来看,它是存储库的一部分,但在下面的级别,在对象存储级别,在概念历史图以下的级别,我们已经有多种存储对象的方式:我们有松散对象,我们有打包的对象,我想可能有一种存储对象的新方法,即“我们这里没有它,但资产服务器可以使用它”,或者类似的东西。

( Thomas Ferris Nicolaisen) 哦,酷...

诸如此类的问题git-annex是:一旦您使用它们,您就会……永远锁定在当时所做的决定中。你知道,如果你决定哦 200 MB 很大,我们将存储在资产服务器上,然后,后来你决定,啊,它应该是300 MB,真倒霉:这永远编码在你的历史中。
所以从概念上说,在 git 级别,这个对象git 存储库中,不是指向它的指针,不是指向资产服务器的指针,实际对象在那里,然后在低级处理这些细节 -级别,在存储级别,然后您可以自由地做出许多不同的决定,甚至更改您稍后决定如何将这些内容存储在磁盘上。

Not an high-priority project for now...

目前不是一个高优先级的项目......



3 years later, in April 2016, Git Minutes 40includes an interview of Michael Haggertyfrom GitHubaround 31' (Thank you Christian Couderfor the interview).

3 年后,也就是 2016 年 4 月,Git Minutes 40包括31左右来自 GitHubMichael Haggerty的采访(感谢Christian Couder的采访)。

He is specialized in reference back-end for quite a while.
He is citing David Turner's workon back-end as the most interesting at the moment. (See David's current "pluggable-backends" branch of his git/git fork)

专攻参考后端有一段时间了
他认为David Turner在后端的工作是目前最有趣的。(参见David 当前的“ pluggable-backends” git/git fork 分支

(transcript)

(成绩单)

Christian Couder (CD): The goal is to have git refs stored in a database, for example? Michael Haggerty (MH): Yeah, I see it as two interesting aspects: The first is simply having the ability to plug in different source entry references. Entry references are stored in the filesystem, as a combination of loose references and packed references.
Loose reference is one file per reference, and packed reference is one big file containing a list of many many references.

So that's a good system, especially for a local usage; as it doesn't have any real performance problem for normal people, but it does have some problem, like you can't store references reflogs after the references have been deleted, because there can be conflicts with newer references which have been created with similar names. There is also a problem where reference names are stored on filesystem so you can have references which are named similar but with different capitalization.
So those are things which could be fixed by having different reference back-end system in general.
And the other aspect of David Turner's patch series is a change to store references in a database called lmdb, this is a really fast memory-based database that has some performance advantages over the file back-end.

Christian Couder (CD):例如,目标是将 git refs 存储在数据库中?Michael Haggerty (MH):是的,我认为它有两个有趣的方面:​​第一个是能够插入不同的源条目引用。条目引用存储在文件系统中,作为松散引用和压缩引用的组合。
松散引用是每个引用一个文件,而压缩引用是一个包含许多引用列表的大文件。

所以这是一个很好的系统,特别是对于本地使用;因为对于普通人来说它没有任何真正的性能问题,但它确实有一些问题,比如在删除引用后你不能存储引用 reflogs,因为可能会与创建类似的新引用发生冲突名称。还有一个问题是引用名称存储在文件系统上,因此您可以拥有名称相似但大小写不同的引用。
所以这些都是可以通过使用不同的参考后端系统来解决的问题。
David Turner 补丁系列的另一个方面是将引用存储在名为lmdb的数据库中,这是一个非常快速的基于内存的数据库,与文件后端相比具有一些性能优势。

[follows other considerations around having faster packing, and reference patch advertisement]

[遵循有关更快打包的其他考虑因素,并参考补丁广告]

回答by Alex North-Keys

Having a auxiliary storage of files referenced from your git-stashed code is where most people go. git-annexdoes look pretty comprehensive, but many shops just use an FTP or HTTP (or S3) repository for the large files, like SQL dumps. My suggestion would be to tie the code in the git repo to the names of the files in the auxiliary storage by stuffing some of the metadata - specifically a checksum (probably SHA) - in to the hash, as well as a date.

有一个辅助存储从你的 git-stashed 代码中引用的文件是大多数人去的地方。 git-annex看起来确实很全面,但许多商店只使用 FTP 或 HTTP(或 S3)存储库来存储大文件,例如 SQL 转储。我的建议是通过将一些元数据(特别是校验和(可能是 SHA))以及日期填充到哈希中,将 git 存储库中的代码与辅助存储中文件的名称联系起来。

  • So each aux file gets a basename, date, and SHA(for some version n) sum.
  • If you have wild file turnover, using only a SHA poses a tiny but real threat of hash collision, hence the inclusion of a date (epoch time or ISO date).
  • Put the resulting filename into the code, so that the aux chunk is included, very specifically, by reference.
  • Structure the names in such a way that a little script can be written easily to git grep all the aux file names, so that the list for any commit is trivial to obtain. This also allows the old ones to be retired at some point, and can be integrated with the deployment system to pull the new aux files out to production without clobbering the old ones (yet), prior to activating the code from the git repo.
  • 所以每个 aux 文件都有一个基本名称、日期和 SHA(对于某些版本 n)总和。
  • 如果您有疯狂的文件周转,仅使用 SHA 会造成散列冲突的微小但真实的威胁,因此包含日期(纪元时间或 ISO 日期)。
  • 将生成的文件名放入代码中,以便非常具体地通过引用包含 aux 块。
  • 以这样一种方式构造名称,以便可以轻松编写一个小脚本来 git grep 所有 aux 文件名,以便任何提交的列表都可以轻松获得。这也允许旧的在某个时候退役,并且可以与部署系统集成,在从 git repo 激活代码之前,将新的 aux 文件拉出到生产中,而不会破坏旧的(还)。

Cramming huge files into git (or most repos) has a nasty impact on git's performance after a while - a git clonereally shouldn't take twenty minutes, for example. Whereas using the files by reference means that some developers will never need to download the large chunks at all (a sharp contrast to the git clone), since the odds are that most are only relevant to the deployed code in production. Your mileage may vary, of course.

将大文件塞进 git(或大多数存储库)一段时间后会对 git 的性能产生严重影响 - 例如,这git clone真的不应该花 20 分钟。而通过引用使用文件意味着一些开发人员根本不需要下载大块(与 形成鲜明对比git clone),因为大多数只与生产中部署的代码相关。当然,您的里程可能会有所不同。

回答by Ariful Islam

Large files uploading sometime create issues and errors. This happens usually. Mainly git supports less than 50MB file to upload. For uploading more than 50MB files in git repository user should need to install another assistant that cooperates to upload big file(.mp4,.mp3,.psd) etc.

there are some basic git commands you know before uploading big file in git. this is the configuration for uploading at github. it needs to install gitlfs.exe

intall it from lfsinstall.exe

有时上传大文件会产生问题和错误。这通常发生。主要是git支持小于50MB的文件上传。在git仓库上传超过50MB的文件需要安装另一个助手来配合上传大文件(.mp4,.mp3,.psd)等

。在git上传大文件之前,您需要了解一些基本的git命令。这是github上传的配置。它需要安装gitlfs.exe

intall它lfsinstall.exe



then you should use basic commands of git along with some different



那么你应该使用 git 的基本命令以及一些不同的

git lfs install
git init
git lfs track ".mp4"
git lfs track ".mp3"
git lfs track ".psd"
git add .
git add .gitattributes
git config lfs.https://github.com/something/repo.git/info/lfs.locksverify false 
git commit -m "Add design file"
git push origin master` ones

you may find you find it lfs.https://github.com/something/repo.git/info/lfs.locksverify falselike instructions during pushcommand if push without using it

如果不使用它推送,您可能会发现它lfs.https://github.com/something/repo.git/info/lfs.locksverify false类似于推送命令期间的说明