从 Git 历史记录中删除二进制文件后,为什么我的存储库仍然很大?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/11255802/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
After deleting a binary file from Git history why is my repository still large?
提问by James McMahon
So let me preface this question by saying that I am aware of the previous questions pertaining to subject on Stackoverflow. In fact I've tried all the solutions I could find but there is a binary file in my repo that just refuses to be removed and continues to greatly inflate my repo size.
所以让我先说这个问题,我知道之前关于 Stackoverflow 主题的问题。事实上,我已经尝试了我能找到的所有解决方案,但我的仓库中有一个二进制文件拒绝删除并继续大大增加我的仓库大小。
Methods I've tried,
我试过的方法,
Both of which were recommend by the Darhuuk's answer to Remove files from git repo completely
Darhuuk 对从 git repo 中完全删除文件的回答推荐了这两者
However, after trying both of those solutions the script to find large files in gitstill finds the offending binary. However the script from this answerno longer finds the commit for the binary. Both of these scripts were suggest by this answer.
但是,在尝试了这两种解决方案后,在 git 中查找大文件的脚本仍然会找到有问题的二进制文件。但是,此答案中的脚本不再找到二进制文件的提交。这两个脚本都是由这个答案建议的。
The repo is still 44mb after the attempts at removal, which is way too large for the relative small size of the source. Which suggestions the large file script is doing it's job properly. I've tried pushing up to github (I made a fork just in case) and then doing a fresh clone to see if the repo size was decreased, but it is still the same size.
尝试删除后,repo 仍然是 44mb,这对于相对较小的源来说太大了。哪些建议大文件脚本正确地完成它的工作。我试过推到 github(我做了一个 fork 以防万一)然后做一个新的克隆来查看 repo 大小是否减少,但它仍然是相同的大小。
Can someone explain what I am doing wrong or suggest an alternative method?
有人可以解释我做错了什么或提出替代方法吗?
I should note that I am not just interested in trimming the file from my local repo, I also want to be able to fix the remote repo on Github.
我应该注意,我不仅对从本地存储库中修剪文件感兴趣,还希望能够在 Github 上修复远程存储库。
回答by James McMahon
2017 Edit: You should probably look into BFG Repo-Cleanerif you are reading this.
2017 年编辑:如果您正在阅读本文,您可能应该查看BFG Repo-Cleaner。
So embarrassingly the reason why my local repos were not shrinking in size is because I was using the wrong path to the file in filter-branch. So while I thank J-16 SDiZ and CodeGnome for their answers my problem was between the chair and the keyboard.
令人尴尬的是,我的本地存储库没有缩小的原因是因为我在过滤器分支中使用了错误的文件路径。因此,虽然我感谢 J-16 SDiZ 和 CodeGnome 的回答,但我的问题是在椅子和键盘之间。
In an effort to make this question less of a monument to my stupidity and actually useful to people I've taken the time to write up the steps one would have to go through after trimming the repo in order to get the repo back up on Github. Hope this helps someone out down the line.
为了让这个问题不再是我愚蠢的纪念碑,而是对人们真正有用,我花时间写下了在修剪回购后必须经历的步骤,以便在 Github 上恢复回购. 希望这可以帮助某人解决问题。
Removing offending files
删除违规文件
To go about remove the offending files run the shell script below, based the Github remove sensitive data howto
要删除有问题的文件,请运行下面的 shell 脚本,基于Github 删除敏感数据 howto
#!/usr/bin/env bash
git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch ''' --prune-empty --tag-name-filter cat -- --all
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now
I went through every branch on my local repository and did this, but I am honestly not sure if this is needed,(you don't need to do this on every branch) you do however need every branch local for the next step, so keep that in mind. Once you are done you should see the size decrease in your local repo. You should also be able to run the blob script in CodeGnome's answer and see the offending blob remove. If not double check the file name and path and make sure they are correct.
我浏览了本地存储库上的每个分支并执行了此操作,但老实说我不确定是否需要这样做,(您不需要在每个分支上都执行此操作)但是下一步需要每个本地分支,所以记在脑子里。完成后,您应该会看到本地存储库的大小减少。您还应该能够在 CodeGnome 的答案中运行 blob 脚本并查看有问题的 blob 删除。如果不是,请仔细检查文件名和路径并确保它们是正确的。
What git filter-branchis actually doing here is running the command listed in quotes on each commit in the repo.
什么git的过滤分支实际上是在这里做什么是运行在每个报价列在回购提交命令。
The rest of the script just cleans any cached version of the old data.
脚本的其余部分只是清除旧数据的任何缓存版本。
Pushing the trimmed repo
推动修剪后的回购
Now that the local repo is in the state you need it to be the trick is to get it back up on Github. Unfortunately as far as I can tell there is no way to completely remove the binary data from the Github repo, here is the quote from the Github sensitive data howto
现在本地存储库处于状态,您需要它的诀窍是将其恢复到 Github 上。不幸的是,据我所知,没有办法从 Github 存储库中完全删除二进制数据,这里引用了Github 敏感数据 howto
Be warned that force-pushing does not erase commits on the remote repo, it simply introduces new ones and moves the branch pointer to point to them. If you are worried about users accessing the bad commits directly via SHA1, you will have to delete the repo and recreate it.
请注意,强制推送不会删除远程存储库上的提交,它只是引入新的提交并移动分支指针以指向它们。如果您担心用户直接通过 SHA1 访问错误提交,则必须删除该存储库并重新创建它。
It sucks that you need to recreate the Github repo, but the good news that recreating the repo is actually pretty easy. The pain is that you also have to recreating the data in issues and the wiki, which I'll go into below.
您需要重新创建 Github 存储库很糟糕,但好消息是重新创建存储库实际上非常容易。痛苦的是,您还必须重新创建问题和 wiki 中的数据,我将在下面进行介绍。
What I recommend is creating a new repo in github and then switch it out with your old repo when you are ready. This can be done by renaming the old to something like "repo name old" and then changing the name of the newly created repo to "repo name". Make sure when you create the new repo to uncheck initialize with README, otherwise your not going to be dealing with a clean slate.
我的建议是在 github 中创建一个新的 repo,然后当你准备好时用你的旧 repo 将它切换出来。这可以通过将旧的重命名为“repo name old”,然后将新创建的 repo 的名称更改为“repo name”来完成。确保在创建新存储库时使用 README 取消选中初始化,否则您将不会处理干净的石板。
If you completed the last step you should have your repo cleaned and ready to go. The remotes now need to changed to match the new Github repo location. I do this by editing the .git/config file directly, though I am sure someone is going to tell me that is not the right way to do it.
如果你完成了最后一步,你应该清理你的仓库并准备好。现在需要更改遥控器以匹配新的 Github 存储库位置。我通过直接编辑 .git/config 文件来做到这一点,尽管我确信有人会告诉我这不是正确的做法。
Before doing the push make sure you have all branches and tags you want to push up in your local repo. Once you are ready push all branches using the follow
在执行推送之前,请确保您在本地存储库中拥有要推送的所有分支和标签。准备好后,使用以下命令推送所有分支
git push --all
git push --tags
Now you should have a remote repo to match your trimmed local repo. Double check that all data made just in case.
现在你应该有一个远程仓库来匹配你修剪过的本地仓库。仔细检查所有数据以防万一。
Now if you don't have to worry about issues or the wiki you are done. If you do read on.
现在,如果您不必担心问题或 wiki,您就大功告成了。如果你继续阅读。
Moving over wikis
在维基上移动
The Github wiki is just another repo associated with your main repo. So to get started clone your old wiki repo somewhere. Then the next part is kind of tricky, as far as I can tell you need to click on the wiki tab of your new repo in order to create the wiki, but it seeds the newly created wiki with a an initial file. So what I did, and I am not sure if there is a better way, is change the remote to the newly create wiki repo and do a push to the new location using
Github wiki 只是与您的主存储库相关联的另一个存储库。因此,要开始在某处克隆您的旧 wiki 存储库。然后下一部分有点棘手,据我所知,您需要单击新存储库的 wiki 选项卡才能创建 wiki,但它为新创建的 wiki 植入了一个初始文件。所以我所做的,我不确定是否有更好的方法,是将遥控器更改为新创建的 wiki 存储库并使用推送到新位置
git push --all --force
The force is needed here because otherwise git will complain about the tip of the current branch not matching. I think this may leave the initial page in a detached state in the git repo, but the effect of that on the size of the repo should be negligible.
这里需要强制,否则 git 会抱怨当前分支的尖端不匹配。我认为这可能会使 git repo 中的初始页面处于分离状态,但它对 repo 大小的影响应该可以忽略不计。
Moving over issues
转移问题
There is advice on this given by this answer. But looking at the scriptlinked in the answer it looks like it is fairly incomplete, there is a TODO for comment importing and I couldn't tell if it would be bring over the state of issues or not.
这个答案对此给出了建议。但是查看答案中链接的脚本,它看起来相当不完整,有一个用于评论导入的 TODO,我不知道它是否会带来问题状态。
So given that I had a fairly small open issues queue and that I didn't mind losing closed issues I elected to bring things over by hand. Note that it is impossible to do this with proper attribution to other people on comments. So I think for a large more established project you would need to write a more robust script to bring everything over, but that wasn't needed for my particular case.
因此,考虑到我有一个相当小的开放问题队列,而且我不介意丢失已解决的问题,我选择手动提交。请注意,在评论中正确归因于其他人是不可能做到这一点的。因此,我认为对于一个更成熟的大型项目,您需要编写一个更强大的脚本来完成所有工作,但对于我的特定情况,这不是必需的。
回答by Todd A. Jacobs
Assuming that you've already removed the blob from your history with git-filter-branch(1) and friends, Git often keeps things around in the reflogs, packfiles, and loose repository objects. The incantation to remove these unreferenced objects is:
假设您已经使用 git-filter-branch(1) 和朋友从您的历史记录中删除了 blob,Git 通常会在 reflogs、packfiles 和松散的存储库对象中保留一些东西。删除这些未引用对象的咒语是:
git prune --expire=now
git reflog expire --expire-unreachable=now --rewrite --all
git repack -a -d
git prune-packed
If you've done this and you stillhave a bigger repository than you think you should, then you still have references to your blob somewherein the repository. You'll have to go back to step one and remove them. This may help:
如果您已经这样做了并且您仍然拥有比您认为应该更大的存储库,那么您仍然可以在存储库中的某处引用您的 blob 。您必须返回到第一步并删除它们。这可能有帮助:
# List all blobs by size in bytes.
git rev-list --all --objects |
awk '{print }' |
git cat-file --batch-check |
fgrep blob |
sort -k3nr
回答by J-16 SDiZ
The script in script to find large files in gitcheck the .pack
file -- that is, the raw object repository. The second script shows the large object is no longer referenced. If you really want to clean that up, you may do a gc
and repack
:
脚本中的脚本在git中查找大文件检查.pack
文件——即原始对象存储库。第二个脚本显示不再引用大对象。如果你真的想清理它,你可以做一个gc
和repack
:
git gc --aggressive --prune=now
git repack -A -d
If this still don't help, you may have an object reference in remote branch, you may try
如果这仍然没有帮助,您可能在远程分支中有一个对象引用,您可以尝试
- Find out which commit have this object, see Which commit has this blob?and do
git branch -a --contains <commit-ish>
- Remove the remote branch using
git branch -r -D branchname
- 找出哪个提交有这个对象,请参阅哪个提交有这个 blob?并做
git branch -a --contains <commit-ish>
- 使用删除远程分支
git branch -r -D branchname
Update -- What is a "remote branch"?
更新——什么是“远程分支”?
Remote branch is what git fetch things to when you do a
git fetch
/git pull
. (git pull
is same asgit fetch refspec
+git merge remote-branch
.If you clone from a remote repository, deleting the remote branch should have no ill effect -- you can always fetch/pull from the remote again using something like
git fetch origin refs/heads/master:refs/remotes/origin/master
(this pull themaster
branch from remote to the remote branchremotes/origin/master
).If this branch was created by you, deleting should be okay too -- because you should have a "normal" (tracking) branch for that. Butyou should double confirm this.
远程分支是当你执行
git fetch
/时 git fetch 的东西git pull
。(git pull
与git fetch refspec
+相同git merge remote-branch
。如果您从远程存储库克隆,删除远程分支应该不会产生不良影响 - 您始终可以使用类似的方法再次从远程获取/拉取
git fetch origin refs/heads/master:refs/remotes/origin/master
(这将master
分支从远程拉到远程分支remotes/origin/master
)。如果这个分支是由你创建的,删除也应该没问题——因为你应该有一个“正常”(跟踪)分支。但你应该再次确认这一点。
回答by Josh Habdas
Can someone explain what I am doing wrong or suggest an alternative method?
有人可以解释我做错了什么或提出替代方法吗?
Have you tried applying DMAIC? Define, Measure, Analyze, Improve, Control.
你试过申请DMAIC吗?define,中号easure,甲nalyze,我的mProve,ÇONTROL。
D - My repo is still large after deleting a file from git history.
M - Determine size of fresh repo using git init
to establish baseline.
A - Identify, validate and select root cause. Experiment with git-repo-analysis
.
I - Identify, test and implement solution. Maybe BFG Repo-Cleanerwill help. Maybe it won't.
C - Sustain the gains. Look at something like Git LFSor other appropriate control method.
D - 从 git 历史记录中删除文件后,我的 repo 仍然很大。
M - 确定git init
用于建立基线的新 repo 的大小。
A - 识别、验证和选择根本原因。试验一下git-repo-analysis
。
I - 识别、测试和实施解决方案。也许BFG Repo-Cleaner会有所帮助。也许不会。
C - 维持收益。看看类似Git LFS或其他适当控制方法的东西。
I also want to be able to fix the remote repo on Github.
我还希望能够修复 Github 上的远程存储库。
This will depend on how you choose to resolve the problem. For exaple, when using BFG to trim files from history it'll rewrite history and update commit SHAs so there's going to be some give and take here depending on your specific needs and desired outcomes.
这将取决于您选择如何解决问题。例如,当使用 BFG 从历史记录中修剪文件时,它将重写历史记录并更新提交 SHA,因此根据您的特定需求和期望的结果,这里会有一些让步。