如何从我的 git repo 中删除未引用的 blob

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1904860/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 07:30:19  来源:igfitidea点击:

How to remove unreferenced blobs from my git repo

git

提问by kkrugler

I have a GitHub repo that had two branches - master & release.

我有一个 GitHub 存储库,它有两个分支 - master 和 release。

The release branch contained binary distribution files that were contributing to a very large repo size (> 250MB), so I decided to clean things up.

发布分支包含导致非常大的存储库大小(> 250MB)的二进制分发文件,所以我决定清理一下。

First I deleted the remote release branch, via git push origin :release

首先我删除了远程发布分支,通过 git push origin :release

Then I deleted the local release branch. First I tried git branch -d release, but git said "error: The branch 'release' is not an ancestor of your current HEAD."which is true, so then I did git branch -D releaseto force it to be deleted.

然后我删除了本地发布分支。首先我尝试过git branch -d release,但 git 说“错误:分支 'release' 不是您当前 HEAD 的祖先。” 这是真的,所以我确实git branch -D release强迫它被删除。

But my repository size, both locally and on GitHub, was still huge. So then I ran through the usual list of git commands, like git gc --prune=today --aggressive, with no luck.

但是我的存储库大小,无论是在本地还是在 GitHub 上,仍然很大。然后我浏览了通常的 git 命令列表,比如git gc --prune=today --aggressive,但没有运气。

By following Charles Bailey's instructions at SO 1029969I was able to get a list of SHA1s for the biggest blobs. I then used the script from SO 460331to find the blobs...and the five biggest don't exist, though smaller blobs are found, so I know the script is working.

按照 Charles Bailey 在SO 1029969的说明,我能够获得最大 blob 的 SHA1 列表。然后我使用SO 460331 中的脚本 来查找 blob ……并且五个最大的 blob 不存在,尽管找到了较小的 blob,所以我知道脚本正在运行。

I think these blogs are the binaries from the release branch, and they somehow got left around after the delete of that branch. What's the right way to get rid of them?

我认为这些博客是来自发布分支的二进制文件,在删除该分支后,它们以某种方式被留下了。摆脱它们的正确方法是什么?

回答by Sam Watkins

... and without further ado, may I present to you this useful command, "git-gc-all", guaranteed to remove allyour git garbage until they might come up extra config variables:

...不用多说,我可以向您展示这个有用的命令“git-gc-all”,保证删除所有git 垃圾,直到它们可能出现额外的配置变量:

git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc

You might also need to run something like these first, oh dear, git is complicated!!

您可能还需要先运行类似的东西,哦,天哪,git 很复杂!!

git remote rm origin
rm -rf .git/refs/original/ .git/refs/remotes/ .git/*_HEAD .git/logs/
git for-each-ref --format="%(refname)" refs/original/ | xargs -n1 --no-run-if-empty git update-ref -d

You might also need to remove some tags, thanks Zitrax:

您可能还需要删除一些标签,感谢 Zitrax:

git tag | xargs git tag -d

I put all this in a script: git-gc-all-ferocious.

我把所有这些都放在一个脚本中:git-gc-all-ferocious

回答by jiasli

As described here, if you want to permanently remove everything referenced only via reflog, simply use

如上所述这里如果你想永久删除只能通过引用日志引用的一切,只需使用

git reflog expire --expire-unreachable=now --all
git gc --prune=now

git reflog expire --expire-unreachable=now --allremoves all references of unreachable commits in reflog.

git reflog expire --expire-unreachable=now --all删除reflog.

git gc --prune=nowremoves the commits themselves.

git gc --prune=now删除提交本身。

Attention: Only using git gc --prune=nowwill not work since those commits are still referenced in the reflog. Therefore, clearing the reflog is mandatory. Also note that if you use rerereit has additional references not cleared by these commands. See git help rererefor more details. In addition, any commits referenced by local or remote branches or tags will not be removed because those are considered as valuable data by git.

注意:仅使用git gc --prune=now将不起作用,因为这些提交仍会在 reflog 中引用。因此,清除 reflog 是强制性的。另请注意,如果您使用rerere它,则会有这些命令未清除的其他引用。有关git help rerere更多详细信息,请参阅。此外,本地或远程分支或标签引用的任何提交都不会被删除,因为它们被 git 视为有价值的数据。

回答by VonC

As mentioned in this SO answer, git gccan actually increase the size of the repo!

正如在这个 SO answer 中提到的,git gc实际上可以增加 repo 的大小!

See also this thread

另请参阅此线程

Now git has a safety mechanism to notdelete unreferenced objects right away when running 'git gc'.
By default unreferenced objects are kept around for a period of 2 weeks. This is to make it easy for you to recover accidentally deleted branches or commits, or to avoid a race where a just-created object in the process of being but not yet referenced could be deleted by a 'git gc' process running in parallel.

So to give that grace period to packed but unreferenced objects, the repack process pushes those unreferenced objects out of the pack into their loose form so they can be aged and eventually pruned.
Objects becoming unreferenced are usually not that many though. Having 404855 unreferenced objects is quite a lot, and being sent those objects in the first place via a clone is stupid and a complete waste of network bandwidth.

Anyway... To solve your problem, you simply need to run 'git gc' with the --prune=nowargument to disable that grace period and get rid of those unreferenced objects right away (safe only if no other git activities are taking place at the same time which should be easy to ensure on a workstation).

And BTW, using 'git gc --aggressive' with a later git version (or 'git repack -a -f -d --window=250 --depth=250')

现在 git 有一个安全机制,在运行 ' '时不会立即删除未引用的对象git gc
默认情况下,未引用的对象会保留 2 周的时间。这是为了让您轻松恢复意外删除的分支或提交,或者避免竞争,即正在运行但尚未引用的刚创建的对象可能会被git gc并行运行的“ ”进程删除。

因此,为了给已打包但未引用的对象提供宽限期,重新打包过程会将这些未引用的对象从包中推出以使其松散形式,以便它们可以老化并最终修剪。
不过,变为未引用的对象通常并不多。有 404855 个未引用的对象非常多,首先通过克隆发送这些对象是愚蠢的,并且完全浪费了网络带宽。

无论如何...要解决您的问题,您只需要运行git gc带有--prune=now参数的' '以禁用该宽限期并立即摆脱那些未引用的对象(仅当没有其他 git 活动同时发生时才安全)在工作站上很容易确保)。

顺便说一句,使用 ' git gc --aggressive' 与更高版本的 git 版本(或 ' git repack -a -f -d --window=250 --depth=250')

The same thread mentions:

同一个线程中提到

 git config pack.deltaCacheSize 1

That limits the delta cache size to one byte (effectively disabling it) instead of the default of 0 which means unlimited. With that I'm able to repack that repository using the above git repackcommand on an x86-64 system with 4GB of RAM and using 4 threads (this is a quad core). Resident memory usage grows to nearly 3.3GB though.

If your machine is SMP and you don't have sufficient RAM then you can reduce the number of threads to only one:

这将增量缓存大小限制为 1 个字节(有效地禁用它),而不是默认值 0,这意味着无限制。这样,我就可以git repack在具有 4GB RAM 和 4 个线程(这是一个四核)的 x86-64 系统上使用上述命令重新打包该存储库。不过,驻留内存使用量增长到近 3.3GB。

如果您的机器是 SMP 并且您没有足够的 RAM,那么您可以将线程数减少到只有一个:

git config pack.threads 1

Additionally, you can further limit memory usage with the --window-memory argumentto 'git repack'.
For example, using --window-memory=128Mshould keep a reasonable upper bound on the delta search memory usage although this can result in less optimal delta match if the repo contains lots of large files.

此外,您可以使用--window-memory argument' git repack'进一步限制内存使用。
例如, using--window-memory=128M应该保持 delta 搜索内存使用的合理上限,尽管如果 repo 包含大量大文件,这可能会导致不太理想的 delta 匹配。



On the filter-branch front, you can consider (with cautious) this script

在过滤器分支方面,您可以(谨慎地)考虑这个脚本

#!/bin/bash
set -o errexit

# Author: David Underhill
# Script to permanently delete files/folders from your git repository.  To use 
# it, cd to your repository's root and then run the script with a list of paths
# you want to delete, e.g., git-delete-history path1 path2

if [ $# -eq 0 ]; then
    exit 0
fi

# make sure we're at the root of git repo
if [ ! -d .git ]; then
    echo "Error: must run this script from the root of a git repository"
    exit 1
fi

# remove all paths passed as arguments from the history of the repo
files=$@
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD

# remove the temporary history git-filter-branch otherwise leaves behind for a long time
rm -rf .git/refs/original/ && git reflog expire --all &&  git gc --aggressive --prune

回答by Jakub Nar?bski

git gc --prune=now, or low level git prune --expire now.

git gc --prune=now,或低级git prune --expire now

回答by vdboor

Each time your HEAD moves, git tracks this in the reflog. If you removed commits, you still have "dangling commits" because they are still referenced by the reflogfor ~30 days. This is the safety-net when you delete commits by accident.

每次您的 HEAD 移动时,git 都会在reflog. 如果您删除了提交,您仍然有“悬空提交”,因为它们仍然被 引用了reflog大约 30 天。这是您意外删除提交时的安全网。

You can use the git reflogcommand remove specific commits, repack, etc.., or just the high level command:

您可以使用git reflog命令删除特定提交、重新打包等,或仅使用高级命令:

git gc --prune=now

回答by nachoparker

You can use git forget-blob.

您可以使用git forget-blob.

The usage is pretty simple git forget-blob file-to-forget. You can get more info here

用法很简单git forget-blob file-to-forget。你可以在这里获得更多信息

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

It will disappear from all the commits in your history, reflog, tags and so on

它将从您的历史记录、引用日志、标签等中的所有提交中消失

I run into the same problem every now and then, and everytime I have to come back to this post and others, that's why I automated the process.

我不时遇到同样的问题,每次我必须回到这篇文章和其他文章时,这就是我自动化流程的原因。

Credits to contributors such as Sam Watkins

感谢 Sam Watkins 等贡献者

回答by W55tKQbuRu28Q4xv

Try to use git-filter-branch- it does's not remove big blobs, but it can remove big files which you specify from the whole repo. For me it reduces repo size from hundreds MB to 12 MB.

尝试使用git-filter-branch- 它不会删除大 blob,但它可以删除您从整个 repo 中指定的大文件。对我来说,它将 repo 大小从数百 MB 减少到 12 MB。

回答by v_abhi_v

Before doing git filter-branchand git gc, you should review tags that are present in your repo. Any real system which has automatic tagging for things like continuous integration and deployments will make unwanted objects still referenced by these tags, hence gccan't remove them and you will still keep wondering why the size of repo is still so big.

在执行git filter-branchand之前git gc,您应该查看存储库中存在的标签。任何对持续集成和部署等具有自动标记的真实系统都会使这些标记引用不需要的对象,因此gc无法删除它们,您仍然会想知道为什么 repo 的大小仍然如此之大。

The best way to get rid of all un-wanted stuff is to run git-filter& git gcand then push master to a new bare repo. The new bare repo will have the cleaned up tree.

摆脱所有不需要的东西的最好方法是运行git-filter&git gc然后将 master 推送到新的裸仓库。新的裸仓库将拥有清理过的树。

回答by StellarVortex

Sometimes, the reason that "gc" doesn't do much good is that there is an unfinished rebase or stash based on an old commit.

有时,"gc" 没有多大用处的原因是存在基于旧提交的未完成的 rebase 或 stash。

回答by Tanguy

To add another tip, don't forget to use git remote pruneto delete the obsolete branches of your remotes before using git gc

要添加另一个提示,在使用git gc之前不要忘记使用git remote prune删除遥控器的过时分支

you can see them with git branch -a

你可以用git branch -a看到它们

It's often useful when you fetch from github and forked repositories...

当您从 github 和分叉存储库获取时,它通常很有用...