为什么我的 git 仓库这么大？

Question

提问by Ian Kelling

145M = .git/objects/pack/

I wrote a script to add up the sizes of differences of each commit and the commit before it going backwards from the tip of each branch. I get 129MB, which is without compression and without accounting for same files across branches and common history among branches.

我写了一个脚本，在它从每个分支的尖端向后移动之前，将每个提交和提交的差异大小相加。我得到 129MB，它没有压缩，也没有考虑跨分支的相同文件和分支之间的共同历史记录。

Git takes all those things into account so I would expect much much smaller repository. So why is .git so big?

Git 将所有这些都考虑在内，所以我希望存储库要小得多。那么为什么 .git 这么大呢？

I've done:

我弄完了：

git fsck --full
git gc --prune=today --aggressive
git repack

To answer about how many files/commits, I have 19 branches about 40 files in each. 287 commits, found using:

为了回答有多少文件/提交，我有 19 个分支，每个分支大约有 40 个文件。287 次提交，发现使用：

git log --oneline --all|wc -l

It should not be taking 10's of megabytes to store information about this.

存储有关此的信息不应占用 10 兆字节。

Answer 1

采纳答案by pgs

I recently pulled the wrong remote repository into the local one (git remote add ...and git remote update). After deleting the unwanted remote ref, branches and tags I still had 1.4GB (!) of wasted space in my repository. I was only able to get rid of this by cloning it with git clone file:///path/to/repository. Note that the file://makes a world of difference when cloning a local repository - only the referenced objects are copied across, not the whole directory structure.

我最近将错误的远程存储库拉入本地存储库（git remote add ...和git remote update）。删除不需要的远程引用、分支和标签后，我的存储库中仍有 1.4GB (!) 浪费的空间。我只能通过使用git clone file:///path/to/repository. 请注意，file://克隆本地存储库时会产生很大的不同 - 仅复制引用的对象，而不是整个目录结构。

Edit: Here's Ian's one liner for recreating all branches in the new repo:

编辑：这是伊恩的一个用于在新仓库中重新创建所有分支的衬垫：

d1=#original repo
d2=#new repo (must already exist)
cd $d1
for b in $(git branch | cut -c 3-)
do
    git checkout $b
    x=$(git rev-parse HEAD)
    cd $d2
    git checkout -b $b $x
    cd $d1
done

Answer 2

回答by Vi.

Some scripts I use:

我使用的一些脚本：

git-fatfiles

git rev-list --all --objects | \
    sed -n $(git rev-list --objects --all | \
    cut -f1 -d' ' | \
    git cat-file --batch-check | \
    grep blob | \
    sort -n -k 3 | \
    tail -n40 | \
    while read hash type size; do 
         echo -n "-e s/$hash/$size/p ";
    done) | \
    sort -n -k1

...
89076 images/screenshots/properties.png
103472 images/screenshots/signals.png
9434202 video/parasite-intro.avi

If you want more lines, see also Perl version in a neighbouring answer: https://stackoverflow.com/a/45366030/266720

如果您需要更多行，请参阅相邻答案中的 Perl 版本：https: //stackoverflow.com/a/45366030/266720

git-eradicate (for `video/parasite.avi`):

git 根除（用于`video/parasite.avi`）：

git filter-branch -f  --index-filter \
    'git rm --force --cached --ignore-unmatch video/parasite-intro.avi' \
     -- --all
rm -Rf .git/refs/original && \
    git reflog expire --expire=now --all && \
    git gc --aggressive && \
    git prune

Note: the second script is designed to remove info from Git completely (including all info from reflogs). Use with caution.

注意：第二个脚本旨在从 Git 中完全删除信息（包括来自 reflogs 的所有信息）。谨慎使用。

Answer 3

回答by CB Bailey

git gcalready does a git repackso there is no sense in manually repacking unless you are going to be passing some special options to it.

git gc已经做了，git repack所以手动重新打包是没有意义的，除非你要传递一些特殊的选项给它。

The first step is to see whether the majority of space is (as would normally be the case) your object database.

第一步是查看大部分空间是否（通常情况下）是您的对象数据库。

git count-objects -v

This should give a report of how many unpacked objects there are in your repository, how much space they take up, how many pack files you have and how much space they take up.

这应该会报告您的存储库中有多少未打包的对象、它们占用了多少空间、您拥有多少打包文件以及它们占用了多少空间。

Ideally, after a repack, you would have no unpacked objects and one pack file but it's perfectly normal to have some objects which aren't directly reference by current branches still present and unpacked.

理想情况下，在重新打包后，您将没有解压缩的对象和一个打包文件，但某些当前分支未直接引用的对象仍然存在和解压缩是完全正常的。

If you have a single large pack and you want to know what is taking up the space then you can list the objects which make up the pack along with how they are stored.

如果您有一个大包并且您想知道什么占用了空间，那么您可以列出构成包的对象以及它们的存储方式。

git verify-pack -v .git/objects/pack/pack-*.idx

Note that verify-packtakes an index file and not the pack file itself. This give a report of every object in the pack, its true size and its packed size as well as information about whether it's been 'deltified' and if so the origin of delta chain.

请注意，verify-pack它采用索引文件而不是包文件本身。这给出了包中每个对象的报告，它的真实大小和它的打包大小，以及关于它是否被“deltified”的信息，如果是，delta链的起源。

To see if there are any unusally large objects in your repository you can sort the output numerically on the third of fourth columns (e.g. | sort -k3n).

要查看您的存储库中是否有任何异常大的对象，您可以在第四列的第三列（例如| sort -k3n）上按数字对输出进行排序。

From this output you will be able to see the contents of any object using the git showcommand, although it is not possible to see exactly where in the commit history of the repository the object is referenced. If you need to do this, try something from this question.

从该输出中，您将能够使用该git show命令查看任何对象的内容，尽管无法准确查看该对象在存储库提交历史记录中的哪个位置被引用。如果您需要这样做，请从这个问题中尝试一些东西。

Answer 4

回答by John Gietzen

Just FYI, the biggest reason why you may end up with unwanted objects being kept around is that git maintains a reflog.

仅供参考，您最终可能会保留不需要的对象的最大原因是 git 维护了一个 reflog。

The reflog is there to save your butt when you accidentally delete your master branch or somehow otherwise catastrophically damage your repository.

当您不小心删除了 master 分支或以其他方式灾难性地损坏了您的存储库时，reflog 可以保存您的屁股。

The easiest way to fix this is to truncate your reflogs before compressing (just make sure that you never want to go back to any of the commits in the reflog).

解决此问题的最简单方法是在压缩之前截断您的引用日志（只需确保您永远不想返回引用日志中的任何提交）。

git gc --prune=now --aggressive
git repack

This is different from git gc --prune=todayin that it expires the entire reflog immediately.

这与git gc --prune=today它立即使整个 reflog 过期不同。

Answer 5

回答by nachoparker

If you want to find what files are taking up space in your git repository, run

如果要查找 git 存储库中占用空间的文件，请运行

git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -5

Then, extract the blob reference that takes up the most space (the last line), and check the filename that is taking so much space

然后，提取占用最多空间的 blob 引用（最后一行），并检查占用这么多空间的文件名

git rev-list --objects --all | grep <reference>

This might even be a file that you removed with git rm, but git remembers it because there are still references to it, such as tags, remotes and reflog.

这甚至可能是您使用删除的文件git rm，但 git 记住了它，因为仍然存在对它的引用，例如标签、远程和 reflog。

Once you know what file you want to get rid of, I recommend using git forget-blob

一旦您知道要删除的文件，我建议您使用 git forget-blob

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

It is easy to use, just do

它易于使用，只需执行

git forget-blob file-to-forget

This will remove every reference from git, remove the blob from every commit in history, and run garbage collection to free up the space.

这将从 git 中删除每个引用，从历史记录中的每个提交中删除 blob，并运行垃圾收集以释放空间。

Answer 6

回答by piojo

The git-fatfiles script from Vi's answer is lovely if you want to see the size of all your blobs, but it's so slow as to be unusable. I removed the 40-line output limit, and it tried to use all my computer's RAM instead of finishing. So I rewrote it: this is thousands of times faster, has added features (optional), and some strange bug was removed--the old version would give inaccurate counts if you sum the output to see the total space used by a file.

如果您想查看所有 blob 的大小，Vi 的答案中的 git-fatfiles 脚本非常有用，但它太慢以至于无法使用。我取消了 40 行输出限制，它试图使用我计算机的所有 RAM 而不是完成。所以我重写了它：这速度快了数千倍，增加了功能（可选），并删除了一些奇怪的错误——如果你对输出求和以查看文件使用的总空间，旧版本会给出不准确的计数。

#!/usr/bin/perl
use warnings;
use strict;
use IPC::Open2;
use v5.14;

# Try to get the "format_bytes" function:
my $canFormat = eval {
    require Number::Bytes::Human;
    Number::Bytes::Human->import('format_bytes');
    1;
};
my $format_bytes;
if ($canFormat) {
    $format_bytes = \&format_bytes;
}
else {
    $format_bytes = sub { return shift; };
}

# parse arguments:
my ($directories, $sum);
{
    my $arg = $ARGV[0] // "";
    if ($arg eq "--sum" || $arg eq "-s") {
        $sum = 1;
    }
    elsif ($arg eq "--directories" || $arg eq "-d") {
        $directories = 1;
        $sum = 1;
    }
    elsif ($arg) {
        print "Usage: $ du -c *.pack
505888  total

$ du -c *.idx
34300   total
 [ --sum, -s | --directories, -d ]\n";
        exit 1;
    } 
}

# the format is [hash, file]
my %revList = map { (split(' ', $_))[0 => 1]; } qx(git rev-list --all --objects);
my $pid = open2(my $childOut, my $childIn, "git cat-file --batch-check");

# The format is (hash => size)
my %hashSizes = map {
    print $childIn $_ . "\n";
    my @blobData = split(' ', <$childOut>);
    if ($blobData[1] eq 'blob') {
        # [hash, size]
        $blobData[0] => $blobData[2];
    }
    else {
        ();
    }
} keys %revList;
close($childIn);
waitpid($pid, 0);

# Need to filter because some aren't files--there are useless directories in this list.
# Format is name => size.
my %fileSizes =
    map { exists($hashSizes{$_}) ? ($revList{$_} => $hashSizes{$_}) : () } keys %revList;


my @sortedSizes;
if ($sum) {
    my %fileSizeSums;
    if ($directories) {
        while (my ($name, $size) = each %fileSizes) {
            # strip off the trailing part of the filename:
            $fileSizeSums{$name =~ s|/[^/]*$||r} += $size;
        }
    }
    else {
        while (my ($name, $size) = each %fileSizes) {
            $fileSizeSums{$name} += $size;
        }
    }

    @sortedSizes = map { [$_, $fileSizeSums{$_}] }
        sort { $fileSizeSums{$a} <=> $fileSizeSums{$b} } keys %fileSizeSums;
}
else {
    # Print the space taken by each file/blob, sorted by size
    @sortedSizes = map { [$_, $fileSizes{$_}] }
        sort { $fileSizes{$a} <=> $fileSizes{$b} } keys %fileSizes;

}

for my $fileSize (@sortedSizes) {
    printf "%s\t%s\n", $format_bytes->($fileSize->[1]), $fileSize->[0];
}

Name this git-fatfiles.pl and run it. To see the disk space used by all revisions of a file, use the --sumoption. To see the same thing, but for files within each directory, use the --directoriesoption. If you install the Number::Bytes::Humancpan module (run "cpan Number::Bytes::Human"), the sizes will be formatted: "21M /path/to/file.mp4".

将此命名为 git-fatfiles.pl 并运行它。要查看文件的所有修订版使用的磁盘空间，请使用该--sum选项。要查看相同的内容，但对于每个目录中的文件，请使用该--directories选项。如果您安装Number::Bytes::Humancpan 模块（运行“cpan Number::Bytes::Human”），则大小将被格式化为：“21M /path/to/file.mp4”。

Answer 7

回答by CesarB

Are you sure you are counting just the .pack files and not the .idx files? They are in the same directory as the .pack files, but do not have any of the repository data (as the extension indicates, they are nothing more than indexes for the corresponding pack — in fact, if you know the correct command, you can easily recreate them from the pack file, and git itself does it when cloning, as only a pack file is transferred using the native git protocol).

您确定您只计算 .pack 文件而不是 .idx 文件吗？它们与 .pack 文件在同一目录中，但没有任何存储库数据（如扩展名所示，它们只不过是相应包的索引——事实上，如果你知道正确的命令，你可以很容易从包文件中重新创建它们，并且 git 本身在克隆时会这样做，因为只有包文件是使用本机 git 协议传输的）。

As a representative sample, I took a look at my local clone of the linux-2.6 repository:

作为代表性示例，我查看了 linux-2.6 存储库的本地克隆：

##代码##

Which indicates an expansion of around 7% should be common.

这表明 7% 左右的扩张应该是普遍的。

There are also the files outside objects/; in my personal experience, of them indexand gitk.cachetend to be the biggest ones (totaling 11M in my clone of the linux-2.6 repository).

外面也有文件objects/；根据我的个人经验，其中index并且gitk.cache往往是最大的（在我的 linux-2.6 存储库克隆中总共有 11M）。

Answer 8

回答by Greg Hewgill

Other git objects stored in .gitinclude trees, commits, and tags. Commits and tags are small, but trees can get big particularly if you have a very large number of small files in your repository. How many files and how many commits do you have?

存储在.git其中的其他 git 对象包括树、提交和标签。提交和标签很小，但树可能会变大，特别是如果您的存储库中有大量小文件。你有多少文件和多少提交？

Answer 9

回答by baudtack

Did you try using git repack?

您是否尝试使用git repack？

Answer 10

回答by v_abhi_v

before doing git filter-branch & git gc you should review tags that are present in your repo. Any real system which has automatic tagging for things like continuous integration and deployments will make unwated objects still refrenced by these tags , hence gc cant remove them and you will still keep wondering why the size of repo is still so big.

在执行 git filter-branch 和 git gc 之前，您应该检查存储库中存在的标签。任何对诸如持续集成和部署之类的事情具有自动标记的真实系统都会使这些标记仍然引用不需要的对象，因此 gc 无法删除它们，您仍然会想知道为什么 repo 的大小仍然如此之大。

The best way to get rid of all un-wanted stuff is to run git-filter & git gc and then push master to a new bare repo. The new bare repo will have the cleaned up tree.

摆脱所有不需要的东西的最好方法是运行 git-filter & git gc 然后将 master 推送到一个新的裸仓库。新的裸仓库将拥有清理过的树。

为什么我的 git 仓库这么大？

提问by Ian Kelling

采纳答案by pgs

回答by Vi.

git-fatfiles

git-fatfiles

git-eradicate (for `video/parasite.avi`):

git 根除（用于`video/parasite.avi`）：

回答by CB Bailey

回答by John Gietzen

回答by nachoparker

回答by piojo

回答by CesarB

回答by Greg Hewgill

回答by baudtack

回答by v_abhi_v

相关推荐

最近更新

标签

为什么我的 git 仓库这么大？

提问by Ian Kelling

采纳答案by pgs

回答by Vi.

git-fatfiles

git-fatfiles

git-eradicate (for video/parasite.avi):

git 根除（用于video/parasite.avi）：

回答by CB Bailey

回答by John Gietzen

回答by nachoparker

回答by piojo

回答by CesarB

回答by Greg Hewgill

回答by baudtack

回答by v_abhi_v

相关推荐

可以包含在构建中以实现可追溯性的“svn 信息”的 Git 替代方案？

如何撤消 Git 中最近的本地提交？

我如何告诉 git 总是选择我的本地版本来合并特定文件上的冲突？

如何将 git 补丁从一个存储库应用到另一个存储库？

相关推荐

最近更新

标签

git-eradicate (for `video/parasite.avi`):

git 根除（用于`video/parasite.avi`）：