在 git repo 中查找超过 x 兆字节的文件,这些文件在 HEAD 中不存在

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/298314/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 16:06:19  来源:igfitidea点击:

Find files in git repo over x megabytes, that don't exist in HEAD

git

提问by dbr

I have a Git repository I store random things in. Mostly random scripts, text files, websites I've designed and so on.

我有一个 Git 存储库,用于存储随机内容。主要是随机脚本、文本文件、我设计的网站等。

There are some large binary files I have deleted over time (generally 1-5MB), which are sitting around increasing the size of the repository, which I don't need in the revision history.

随着时间的推移,我删除了一些大型二进制文件(通常为 1-5MB),这些文件会增加存储库的大小,而我在修订历史中不需要这些文件。

Basically I want to be able to do..

基本上我希望能够做到..

me@host:~$ [magic command or script]
aad29819a908cc1c05c3b1102862746ba29bafc0 : example/blah.psd : 3.8MB : 130 days old
6e73ca29c379b71b4ff8c6b6a5df9c7f0f1f5627 : another/big.file : 1.12MB : 214 days old

..then be able to go though each result, checking if it's no longer required then removing it (probably using filter-branch)

..然后能够通过每个结果,检查它是否不再需要然后删除它(可能使用filter-branch

采纳答案by Aristotle Pagaltzis

This is an adaptation of the git-find-blobscript I posted previously:

这是我之前发布git-find-blob脚本的改编版:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }

@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( ,  ) : usage();

my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp; 

sub walk_tree {
    my ( $tree, @path ) = @_;
    my @subtree;
    my @r;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
            if ( $type eq 'tree' ) {
                push @subtree, [ $sha1, $name ];
            }
            elsif ( $type eq 'blob' and $size >= $cutoff ) {
                push @r, [ $size, @path, $name ];
            }
        }
    }

    push @r, walk_tree( $_->[0], @path, $_->[1] )
        for @subtree;

    return @r;
}

memoize 'walk_tree';

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
    or die "Couldn't open pipe to git-log: $!\n";

my %seen;
while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $age ) = split " ", $_, 3;
    my $is_header_printed;
    for ( walk_tree( $tree ) ) {
        my ( $size, @path ) = @$_;
        my $path = join '/', @path;
        next if $seen{ $path }++;
        print "$commit $age\n" if not $is_header_printed++;
        print "\t$size\t$path\n";
    }
}

回答by mislav

More compact ruby script:

更紧凑的 ruby​​ 脚本:

#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte

big_files = {}

IO.popen("git rev-list #{head}", 'r') do |rev_list|
  rev_list.each_line do |commit|
    commit.chomp!
    for object in `git ls-tree -zrl #{commit}`.split("
ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M  example/blah.psd  (aad2981: 4 months ago)
1.1M  another/big.file  (6e73ca2: 2 weeks ago)
") bits, type, sha, size, path = object.split(/\s+/, 5) size = size.to_i big_files[sha] = [path, size, commit] if size >= treshold end end end big_files.each do |sha, (path, size, commit)| where = `git show -s #{commit} --format='%h: %cr'`.chomp puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where] end

Usage:

用法:

#!/usr/bin/env python

import os, sys

def getOutput(cmd):
    return os.popen(cmd).read()

if (len(sys.argv) <> 2):
    print "usage: %s size_in_bytes" % sys.argv[0]
else:
    maxSize = int(sys.argv[1])

    revisions = getOutput("git rev-list HEAD").split()

    bigfiles = set()
    for revision in revisions:
        files = getOutput("git ls-tree -zrl %s" % revision).split('
$ java -jar bfg.jar  --strip-blobs-bigger-than 1M  my-repo.git
') for file in files: if file == "": continue splitdata = file.split() commit = splitdata[2] if splitdata[3] == "-": continue size = int(splitdata[3]) path = splitdata[4] if (size > maxSize): bigfiles.add("%10d %s %s" % (size, commit, path)) bigfiles = sorted(bigfiles, reverse=True) for f in bigfiles: print f

回答by SigTerm

Python script to do the same thing (based on this post):

做同样事情的 Python 脚本(基于这篇文章):

$ git gc --prune=now --aggressive

回答by Roberto Tyley

You want to use the BFG Repo-Cleaner, a faster, simpler alternative to git-filter-branchspecifically designed for removing large filesfrom Git repos.

您想使用BFG Repo-Cleaner,这是一种更快、更简单的替代方案,git-filter-branch专为从 Git 存储库中删除大文件而设计。

Download the BFG jar(requires Java 6 or above) and run this command:

下载BFG jar(需要 Java 6 或更高版本)并运行以下命令:

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );

my $min = shift;
$min =~ /^\d+$/ or die "need a number";

# ----------------------------------------------------------------------

my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;

# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";

my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
    next unless / ./;    # no commits or top level trees
    ( $blob, $name ) = split;
    $name{$blob} = $name;
    say $blobfile $blob;
}
close($blobfile);

# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";

my ( $dummy, $size );
while (<$sizes>) {
    ( $blob, $dummy, $size ) = split;
    next if $size < $min;
    $size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}

my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;

say "
The size shown is the largest that file has ever attained.  But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";

# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
    say "$size{$name}\t$name";

    for my $r (@refs) {
        system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
    }
    print "\n";
}
print "\n";

Any files over 1M in size (that aren't in your latestcommit) will be removed from your Git repository's history. You can then use git gcto clean away the dead data:

任何大小超过 100 万的文件(不在您的最新提交中)都将从您的 Git 存储库的历史记录中删除。然后您可以使用git gc清除死数据:

$ git reflog expire --expire=1.minute refs/heads/master
     # all deletions up to 1 minute  ago available to be garbage-collected
$ git fsck --unreachable 
     # lists all the blobs(file contents) that will be garbage-collected 
$ git prune 
$ git gc

The BFG is typically 10-50xfaster than running git-filter-branchand the options are tailored around these two common use-cases:

BFG 通常比运行快10-50倍,git-filter-branch并且选项是围绕这两个常见用例量身定制的:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials& other Private data
  • 删除疯狂的大文件
  • 删除密码、凭据和其他私人数据

Full disclosure: I'm the author of the BFG Repo-Cleaner.

完全披露:我是 BFG Repo-Cleaner 的作者。

回答by Sitaram Chamarty

Ouch... that first script (by Aristotle), is pretty slow. On the git.git repo, looking for files > 100k, it chews up the CPU for about 6 minutes.

哎哟...第一个脚本(由亚里士多德编写)非常慢。在 git.git 存储库中,查找大于 100k 的文件时,它会占用 CPU 大约 6 分钟。

It also appears to have several wrong SHAs printed -- often a SHA will be printed that has nothing to do with the filename mentioned in the next line.

它似乎也打印了几个错误的 SHA - 通常会打印一个与下一行中提到的文件名无关的 SHA。

Here's a faster version. The output format is different, but it is very fast, and it is also -- as far as I can tell -- correct.

这是一个更快的版本。输出格式不同,但速度非常快,而且——据我所知——也是正确的。

The program isa bit longer but a lot of it is verbiage.

该方案多一点的时间,但很多是空话。

#!/bin/bash
if [ "$#" != 1 ]
then
  echo 'git large.sh [size]'
  exit
fi

declare -A big_files
big_files=()
echo printing results

while read commit
do
  while read bits type sha size path
  do
    if [ "$size" -gt "" ]
    then
      big_files[$sha]="$sha $size $path"
    fi
  done < <(git ls-tree --abbrev -rl $commit)
done < <(git rev-list HEAD)

for file in "${big_files[@]}"
do
  read sha size path <<< "$file"
  if git ls-tree -r HEAD | grep -q $sha
  then
    echo $file
  fi
done

回答by Paul

Aristote's script will show you what you want. You also need to know that deleted files will still take up space in the repo.

亚里士多德的脚本会告诉你你想要什么。您还需要知道已删除的文件仍会占用存储库中的空间。

By default, Git keeps changes around for 30 days before they can be garbage-collected. If you want to remove them now:

默认情况下,Git 会将更改保留 30 天,然后才能进行垃圾收集。如果您现在想删除它们:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk -v min_mb=10 '/^blob/ &&  >= min_mb*2^20 {print substr(
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4
,6)}' \ | grep -vFf <(git ls-tree -r HEAD | awk '{print }') \ | sort --numeric-sort --key=2 \ | cut -c 1-12,41- \ | $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

A side comment: While I am big fan of Git, Git doesn't bring any advantages to storing your collection of "random scripts, text files, websites" and binary files. Git tracks changes in content, particularly the history of coordinated changes among many text files, and does so very efficiently and effectively. As your question illustrates, Git doesn't have good tools for tracking individual file changes. And it doesn't track changes in binaries, so any revision stores another full copy in the repo.

旁注:虽然我是 Git 的忠实粉丝,但 Git 并没有为存储“随机脚本、文本文件、网站”和二进制文件的集合带来任何优势。Git 跟踪内容的更改,尤其是许多文本文件之间协调更改的历史记录,并且这样做非常高效和有效。正如您的问题所示,Git 没有用于跟踪单个文件更改的好工具。它不会跟踪二进制文件中的更改,因此任何修订都会在存储库中存储另一个完整副本。

Of course this use of Git is a perfectly good way to get familiar with how it works.

当然,使用 Git 是熟悉其工作原理的绝佳方式。

回答by Steven Penny

#!/usr/bin/env python
import os, sys

bigfiles = []
for revision in os.popen('git rev-list HEAD'):
    for f in os.popen('git ls-tree -zrl %s' % revision).read().split('##代码##'):
        if f:
            mode, type, commit, size, path = f.split(None, 4)
            if int(size) > int(sys.argv[1]):
                bigfiles.append((int(size), commit, path))

for f in sorted(set(bigfiles)):
    print f

Source

来源

回答by raphinesse

This bash "one-liner" displays all blob objects in the repository that are larger than 10 MiB and are not present in HEADsorted from smallest to largest.

这个 bash “one-liner” 显示存储库中所有大于 10 MiB 且未按HEAD从小到大排序的blob 对象。

It's very fast, easy to copy & paste and only requires standard GNU utilities.

非常快速,易于复制和粘贴,并且只需要标准的 GNU 实用程序。

##代码##

This will generate output like this:

这将生成如下输出:

##代码##

For more information, including an output format more suitable for further script processing, see my original answeron a similar question.

有关更多信息,包括更适合进一步脚本处理的输出格式,请参阅我对类似问题的原始回答

macOS users: Since numfmtis not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

macOS 用户:由于numfmt在 macOS 上不可用,您可以省略最后一行并处理原始字节大小或brew install coreutils.

回答by Collin Anderson

回答by Caustic

A little late to the party, but git-fathas this functionality built in.

聚会有点晚了,但是git-fat内置了这个功能。

Just install it with pip and run git fat -a find 100000where the number at the end is in Bytes.

只需使用 pip 安装它并git fat -a find 100000在末尾的数字以字节为单位运行。