如何从 Git 存储库的提交历史记录中删除/删除大文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/2100907/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 07:41:11  来源:igfitidea点击:

How to remove/delete a large file from commit history in Git repository?

gitversion-controlgit-rebasegit-rewrite-history

提问by culebrón

Occasionally I dropped a DVD-rip into a website project, then carelessly git commit -a -m ..., and, zap, the repo was bloated by 2.2 gigs. Next time I made some edits, deleted the video file, and committed everything, but the compressed file is still there in the repository, in history.

偶尔,我将 DVD-rip 放入网站项目中,然后不小心git commit -a -m ...,而且,zap,repo 膨胀了 2.2 gigs。下次我进行了一些编辑,删除了视频文件,并提交了所有内容,但压缩文件仍在存储库中,在历史记录中。

I know I can start branches from those commits and rebase one branch onto another. But what should I do to merge together the 2 commits so that the big file didn't show in the history and were cleaned in garbage collection procedure?

我知道我可以从这些提交开始分支并将一个分支重新设置为另一个分支。但是我应该怎么做才能将 2 个提交合并在一起,以便大文件没有显示在历史记录中并在垃圾收集过程中被清除?

采纳答案by Roberto Tyley

Use the BFG Repo-Cleaner, a simpler, faster alternative to git-filter-branchspecifically designed for removing unwanted files from Git history.

使用BFG Repo-Cleaner,这是一种更简单、更快速的替代方案,git-filter-branch专门用于从 Git 历史记录中删除不需要的文件。

Carefully follow the usage instructions, the core part is just this:

仔细按照使用说明,核心部分就是这样:

$ java -jar bfg.jar --strip-blobs-bigger-than 100M my-repo.git

Any files over 100MB in size (that aren't in your latestcommit) will be removed from your Git repository's history. You can then use git gcto clean away the dead data:

任何超过 100MB 的文件(不在您的最新提交中)都将从您的 Git 存储库的历史记录中删除。然后您可以使用git gc清除死数据:

$ git gc --prune=now --aggressive

The BFG is typically at least 10-50xfaster than running git-filter-branch, and generally easier to use.

BFG 通常至少比 running 快10-50git-filter-branch,并且通常更易于使用。

Full disclosure: I'm the author of the BFG Repo-Cleaner.

完全披露:我是 BFG Repo-Cleaner 的作者。

回答by Greg Bacon

What you want to do is highly disruptive if you have published history to other developers. See “Recovering From Upstream Rebase” in the git rebasedocumentationfor the necessary steps after repairing your history.

如果您向其他开发人员发布了历史记录,那么您想要做的事情是非常具有破坏性的。有关修复历史记录后的必要步骤,请参阅文档中的“从上游 Rebase 恢复”git rebase

You have at least two options: git filter-branchand an interactive rebase, both explained below.

您至少有两个选择:git filter-branch和交互式变基,两者都在下面解释。

Using git filter-branch

使用 git filter-branch

I had a similar problem with bulky binary test data from a Subversion import and wrote about removing data from a git repository.

我遇到了来自 Subversion 导入的大量二进制测试数据的类似问题,并写了关于从 git 存储库中删除数据的文章

Say your git history is:

说你的 git 历史是:

$ git lola --name-status
* f772d66 (HEAD, master) Login page
| A     login.html
* cb14efd Remove DVD-rip
| D     oops.iso
* ce36c98 Careless
| A     oops.iso
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

Note that git lolais a non-standard but highly useful alias. With the --name-statusswitch, we can see tree modifications associated with each commit.

请注意,这git lola是一个非标准但非常有用的别名。通过--name-status切换,我们可以看到与每次提交相关的树修改。

In the “Careless” commit (whose SHA1 object name is ce36c98) the file oops.isois the DVD-rip added by accident and removed in the next commit, cb14efd. Using the technique described in the aforementioned blog post, the command to execute is:

在“粗心”提交(其 SHA1 对象名称为 ce36c98)中,该文件oops.iso是偶然添加并在下一次提交 cb14efd 中删除的 DVD-rip。使用上述博客文章中描述的技术,要执行的命令是:

git filter-branch --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch oops.iso" \
  --tag-name-filter cat -- --all

Options:

选项:

  • --prune-emptyremoves commits that become empty (i.e., do not change the tree) as a result of the filter operation. In the typical case, this option produces a cleaner history.
  • -dnames a temporary directory that does not yet exist to use for building the filtered history. If you are running on a modern Linux distribution, specifying a tree in /dev/shmwill result in faster execution.
  • --index-filteris the main event and runs against the index at each step in the history. You want to remove oops.isowherever it is found, but it isn't present in all commits. The command git rm --cached -f --ignore-unmatch oops.isodeletes the DVD-rip when it is present and does not fail otherwise.
  • --tag-name-filterdescribes how to rewrite tag names. A filter of catis the identity operation. Your repository, like the sample above, may not have any tags, but I included this option for full generality.
  • --specifies the end of options to git filter-branch
  • --allfollowing --is shorthand for all refs. Your repository, like the sample above, may have only one ref (master), but I included this option for full generality.
  • --prune-empty删除由于过滤操作而变为空的提交(,不更改树)。在典型情况下,此选项会生成更清晰的历史记录。
  • -d命名一个尚不存在的临时目录以用于构建过滤的历史记录。如果您在现代 Linux 发行版上运行,指定树 in/dev/shm将导致更快的执行
  • --index-filter是主要事件,并在历史记录中的每一步都针对索引运行。您想删除oops.iso任何找到它的地方,但它并不存在于所有提交中。该命令git rm --cached -f --ignore-unmatch oops.iso会在 DVD-rip 存在时删除它,否则不会失败。
  • --tag-name-filter描述如何重写标签名称。过滤器cat是标识操作。您的存储库,就像上面的示例一样,可能没有任何标签,但为了全面通用,我包含了这个选项。
  • --指定选项的结尾 git filter-branch
  • --all以下--是所有参考的简写。您的存储库,就像上面的示例一样,可能只有一个 ref(主),但为了完全通用,我包含了这个选项。

After some churning, the history is now:

经过一番翻腾,现在的历史是:

$ git lola --name-status
* 8e0a11c (HEAD, master) Login page
| A     login.html
* e45ac59 Careless
| A     other.html
|
| * f772d66 (refs/original/refs/heads/master) Login page
| | A   login.html
| * cb14efd Remove DVD-rip
| | D   oops.iso
| * ce36c98 Careless
|/  A   oops.iso
|   A   other.html
|
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

Notice that the new “Careless” commit adds only other.htmland that the “Remove DVD-rip” commit is no longer on the master branch. The branch labeled refs/original/refs/heads/mastercontains your original commits in case you made a mistake. To remove it, follow the steps in “Checklist for Shrinking a Repository.”

请注意,新的“粗心”提交仅添加other.html,“删除 DVD-rip”提交不再在主分支上。标记的分支refs/original/refs/heads/master包含您的原始提交,以防您犯了错误。要删除它,请按照“缩小存储库的清单”中的步骤操作

$ git update-ref -d refs/original/refs/heads/master
$ git reflog expire --expire=now --all
$ git gc --prune=now

For a simpler alternative, clone the repository to discard the unwanted bits.

对于更简单的替代方案,克隆存储库以丢弃不需要的位。

$ cd ~/src
$ mv repo repo.old
$ git clone file:///home/user/src/repo.old repo

Using a file:///...clone URL copies objects rather than creating hardlinks only.

使用file:///...克隆 URL 复制对象而不是仅创建硬链接。

Now your history is:

现在你的历史是:

$ git lola --name-status
* 8e0a11c (HEAD, master) Login page
| A     login.html
* e45ac59 Careless
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

The SHA1 object names for the first two commits (“Index” and “Admin page”) stayed the same because the filter operation did not modify those commits. “Careless” lost oops.isoand “Login page” got a new parent, so their SHA1s didchange.

前两次提交(“索引”和“管理页面”)的 SHA1 对象名称保持不变,因为过滤操作没有修改这些提交。“粗心”丢失了oops.iso,“登录页面”有了新的父级,所以他们的 SHA1确实发生了变化。

Interactive rebase

交互式变基

With a history of:

具有以下历史:

$ git lola --name-status
* f772d66 (HEAD, master) Login page
| A     login.html
* cb14efd Remove DVD-rip
| D     oops.iso
* ce36c98 Careless
| A     oops.iso
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

you want to remove oops.isofrom “Careless” as though you never added it, and then “Remove DVD-rip” is useless to you. Thus, our plan going into an interactive rebase is to keep “Admin page,” edit “Careless,” and discard “Remove DVD-rip.”

您想oops.iso从“粗心大意”中删除,就好像您从未添加过它一样,然后“删除 DVD-rip”对您来说毫无用处。因此,我们进入交互式 rebase 的计划是保留“管理页面”,编辑“粗心”,并丢弃“删除 DVD-rip”。

Running $ git rebase -i 5af4522starts an editor with the following contents.

运行会$ git rebase -i 5af4522启动一个包含以下内容的编辑器。

pick ce36c98 Careless
pick cb14efd Remove DVD-rip
pick f772d66 Login page

# Rebase 5af4522..f772d66 onto 5af4522
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# If you remove a line here THAT COMMIT WILL BE LOST.
# However, if you remove everything, the rebase will be aborted.
#

Executing our plan, we modify it to

执行我们的计划,我们将其修改为

edit ce36c98 Careless
pick f772d66 Login page

# Rebase 5af4522..f772d66 onto 5af4522
# ...

That is, we delete the line with “Remove DVD-rip” and change the operation on “Careless” to be editrather than pick.

也就是说,我们删除了“Remove DVD-rip”行,将“Careless”的操作改为edit而不是pick

Save-quitting the editor drops us at a command prompt with the following message.

保存退出编辑器会将我们带到带有以下消息的命令提示符处。

Stopped at ce36c98... Careless
You can amend the commit now, with

        git commit --amend

Once you are satisfied with your changes, run

        git rebase --continue

As the message tells us, we are on the “Careless” commit we want to edit, so we run two commands.

正如消息告诉我们的,我们正处于要编辑的“粗心”提交上,因此我们运行了两个命令。

$ git rm --cached oops.iso
$ git commit --amend -C HEAD
$ git rebase --continue

The first removes the offending file from the index. The second modifies or amends “Careless” to be the updated index and -C HEADinstructs git to reuse the old commit message. Finally, git rebase --continuegoes ahead with the rest of the rebase operation.

第一个从索引中删除有问题的文件。第二个修改或修改“Careless”为更新的索引,并-C HEAD指示 git 重用旧的提交消息。最后,git rebase --continue继续执行其余的 rebase 操作。

This gives a history of:

这给出了历史:

$ git lola --name-status
* 93174be (HEAD, master) Login page
| A     login.html
* a570198 Careless
| A     other.html
* 5af4522 Admin page
| A     admin.html
* e738b63 Index
  A     index.html

which is what you want.

这就是你想要的。

回答by Gary Gauh

Why not use this simple but powerful command?

为什么不使用这个简单但功能强大的命令呢?

git filter-branch --tree-filter 'rm -f DVD-rip' HEAD

The --tree-filteroption runs the specified command after each checkout of the project and then recommits the results. In this case, you remove a file called DVD-rip from every snapshot, whether it exists or not.

--tree-filter选项在每次签出项目后运行指定的命令,然后重新提交结果。在这种情况下,您从每个快照中删除一个名为 DVD-rip 的文件,无论它是否存在。

If you know which commit introduced the huge file (say 35dsa2), you can replace HEAD with 35dsa2..HEAD to avoid rewriting too much history, thus avoiding diverging commits if you haven't pushed yet. This comment courtesy of @alpha_989 seems too important to leave out here.

如果你知道哪个提交引入了大文件(比如 35dsa2),你可以用 35dsa2..HEAD 替换 HEAD 以避免重写太多历史,从而避免在你还没有推送的情况下发散提交。这条由@alpha_989 提供的评论似乎太重要了,不能遗漏在这里。

See this link.

请参阅此链接

回答by Sridhar Sarnobat

(The best answer I've seen to this problem is: https://stackoverflow.com/a/42544963/714112, copied here since this thread appears high in Google search rankings but that other one doesn't)

(我见过这个问题的最佳答案是:https://stackoverflow.com/a/42544963/714112,复制到这里,因为这个线程在谷歌搜索排名中看起来很高,但另一个没有)

A blazingly fast shell one-liner

超快的外壳单衬

This shell script displays all blob objects in the repository, sorted from smallest to largest.

此 shell 脚本显示存储库中的所有 blob 对象,从小到大排序。

For my sample repo, it ran about 100 times fasterthan the other ones found here.
On my trusty Athlon II X4 system, it handles the Linux Kernel repositorywith its 5,622,155 objects in just over a minute.

对于我的示例存储库,它的运行速度比此处找到的其他存储库快 100 倍
在我值得信赖的 Athlon II X4 系统上,它可以在一分钟内处理带有 5,622,155 个对象的Linux 内核存储库

The Base Script

基本脚本

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr(
...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4
,6)}' \ | sort --numeric-sort --key=2 \ | cut --complement --characters=13-40 \ | numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

When you run above code, you will get nice human-readable outputlike this:

当你运行上面的代码时,你会得到很好的人类可读的输出,如下所示:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch a b' HEAD

Fast File Removal

快速删除文件

Suppose you then want to remove the files aand bfrom every commit reachable from HEAD, you can use this command:

假设您然后想要删除文件,a并且b从可访问的每个提交中删除文件HEAD,您可以使用以下命令:

git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch FOLDERNAME" -- --all

回答by Justin

After trying virtually every answer in SO, I finally found this gem that quickly removed and deleted the large files in my repository and allowed me to sync again: http://www.zyxware.com/articles/4027/how-to-delete-files-permanently-from-your-local-and-remote-git-repositories

在尝试了几乎所有的答案之后,我终于找到了这个 gem,它可以快速删除并删除我存储库中的大文件,并允许我再次同步:http: //www.zyxware.com/articles/4027/how-to-delete -files-permanently-from-your-local-and-remote-git-repositories

CD to your local working folder and run the following command:

CD 到您的本地工作文件夹并运行以下命令:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

replace FOLDERNAME with the file or folder you wish to remove from the given git repository.

将 FOLDERNAME 替换为您希望从给定 git 存储库中删除的文件或文件夹。

Once this is done run the following commands to clean up the local repository:

完成后,运行以下命令来清理本地存储库:

git push --all --force

Now push all the changes to the remote repository:

现在将所有更改推送到远程存储库:

git filter-branch --force --index-filter 'git rm --cached -r --ignore-unmatch oops.iso' --prune-empty --tag-name-filter cat -- --all
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

This will clean up the remote repository.

这将清理远程存储库。

回答by Kostanos

These commands worked in my case:

这些命令在我的情况下有效:

# WARNING!!!
# this will rewrite completely your bitbucket refs
# will delete all branches that you didn't have in your local

git push --all --prune --force

# Once you pushed, all your teammates need to clone repository again
# git pull will not work

It is little different from the above versions.

它与上述版本略有不同。

For those who need to push this to github/bitbucket (I only tested this with bitbucket):

对于那些需要将其推送到 github/bitbucket 的人(我仅使用 bitbucket 对此进行了测试):

$ git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch YOURFILENAME" HEAD
$ rm -rf .git/refs/original/ 
$ git reflog expire --all 
$ git gc --aggressive --prune
$ git push origin master --force

回答by mkljun

Just note that this commands can be very destructive. If more people are working on the repo they'll all have to pull the new tree. The three middle commands are not necessary if your goal is NOT to reduce the size. Because the filter branch creates a backup of the removed file and it can stay there for a long time.

请注意,此命令可能非常具有破坏性。如果有更多人在 repo 上工作,他们将不得不拉新树。如果您的目标不是减小大小,则不需要三个中间命令。因为过滤器分支会创建已删除文件的备份,并且它可以在那里停留很长时间。

##代码##

回答by Thorsten Lorenz

git filter-branch --tree-filter 'rm -f path/to/file' HEADworked pretty well for me, although I ran into the same problem as described here, which I solved by following this suggestion.

git filter-branch --tree-filter 'rm -f path/to/file' HEAD对我来说效果很好,尽管我遇到了与此处描述的相同的问题,我按照此建议解决了问题。

The pro-git book has an entire chapter on rewriting history- have a look at the filter-branch/Removing a File from Every Commitsection.

pro-git 书有一整章是关于重写历史的- 请查看filter-branch/Removing a File from Every Commit部分。

回答by Soheil

If you know your commit was recent instead of going through the entire tree do the following: git filter-branch --tree-filter 'rm LARGE_FILE.zip' HEAD~10..HEAD

如果您知道您的提交是最近的而不是遍历整个树,请执行以下操作: git filter-branch --tree-filter 'rm LARGE_FILE.zip' HEAD~10..HEAD

回答by lfender6445

I ran into this with a bitbucket account, where I had accidentally stored ginormous *.jpa backups of my site.

我用一个 bitbucket 帐户遇到了这个问题,我不小心在那里存储了我网站的大量 *.jpa 备份。

git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch MY-BIG-DIRECTORY-OR-FILE' --tag-name-filter cat -- --all

git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch MY-BIG-DIRECTORY-OR-FILE' --tag-name-filter cat -- --all

Relpace MY-BIG-DIRECTORYwith the folder in question to completely rewrite your history (including tags).

RelpaceMY-BIG-DIRECTORY与有问题的文件夹完全重写你的历史记录(包括标签)。

source: https://web.archive.org/web/20170727144429/http://naleid.com:80/blog/2012/01/17/finding-and-purging-big-files-from-git-history/

来源:https: //web.archive.org/web/20170727144429/http: //naleid.com: 80/blog/2012/01/17/finding-and-purging-big-files-from-git-history/