git 哪个提交有这个 blob?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/223678/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 05:53:35  来源:igfitidea点击:

Which commit has this blob?

gitversion-control

提问by Readonly

Given the hash of a blob, is there a way to get a list of commits that have this blob in their tree?

给定一个 blob 的哈希值,有没有办法获取在其树中包含该 blob 的提交列表?

采纳答案by Aristotle Pagaltzis

Both of the following scripts take the blob's SHA1 as the first argument, and after it, optionally, any arguments that git logwill understand. E.g. --allto search in all branches instead of just the current one, or -gto search in the reflog, or whatever else you fancy.

以下两个脚本都将 blob 的 SHA1 作为第一个参数,在它之后,可以选择任何git log可以理解的参数。例如--all,在所有分支中搜索,而不仅仅是当前分支,或者-g在引用日志中搜索,或者其他任何你喜欢的。

Here it is as a shell script – short and sweet, but slow:

这是一个 shell 脚本——简短而甜蜜,但速度很慢:

#!/bin/sh
obj_name=""
shift
git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
    if git ls-tree -r $tree | grep -q "$obj_name" ; then
        echo $commit "$subject"
    fi
done

And an optimised version in Perl, still quite short but much faster:

Perl 中的优化版本,仍然很短,但速度要快得多:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

my $obj_name;

sub check_tree {
    my ( $tree ) = @_;
    my @subtree;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            /\A[0-7]{6} (\S+) (\S+)/
                or die "unexpected git-ls-tree output";
            return 1 if  eq $obj_name;
            push @subtree,  if  eq 'tree';
        }
    }

    check_tree( $_ ) && return 1 for @subtree;

    return;
}

memoize 'check_tree';

die "usage: git-find-blob <blob> [<git-log arguments ...>]\n"
    if not @ARGV;

my $obj_short = shift @ARGV;
$obj_name = do {
    local $ENV{'OBJ_NAME'} = $obj_short;
     `git rev-parse --verify $OBJ_NAME`;
} or die "Couldn't parse $obj_short: $!\n";
chomp $obj_name;

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
    or die "Couldn't open pipe to git-log: $!\n";

while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $subject ) = split " ", $_, 3;
    print "$commit $subject\n" if check_tree( $tree );
}

回答by aragaer

Unfortunately scripts were a bit slow for me, so I had to optimize a bit. Luckily I had not only the hash but also the path of a file.

不幸的是,脚本对我来说有点慢,所以我不得不优化一下。幸运的是,我不仅有哈希值,还有文件的路径。

git log --all --pretty=format:%H -- <path> | xargs -n1 -I% sh -c "git ls-tree % -- <path> | grep -q <hash> && echo %"

回答by VonC

Given the hash of a blob, is there a way to get a list of commits that have this blob in their tree?

给定一个 blob 的哈希值,有没有办法获取在其树中包含该 blob 的提交列表?

With Git 2.16 (Q1 2018), git describewould be a good solution, since it was taught to dig trees deeper to find a <commit-ish>:<path>that refers to a given blob object.

使用 Git 2.16(2018 年第一季度)git describe将是一个很好的解决方案,因为它被教导更深入地挖掘树以找到<commit-ish>:<path>引用给定 blob 对象的 a。

See commit 644eb60, commit 4dbc59a, commit cdaed0c, commit c87b653, commit ce5b6f9(16 Nov 2017), and commit 91904f5, commit 2deda00(02 Nov 2017) by Stefan Beller (stefanbeller).
(Merged by Junio C Hamano -- gitster--in commit 556de1a, 28 Dec 2017)

请参阅Stefan Beller ( ) 的提交 644eb60提交 4dbc59a提交 cdaed0c提交 c87b653提交 ce5b6f9(2017 年 11 月 16 日)和提交 91904f5提交 2deda00(2017 年 11 月 2 日(由Junio C Hamano合并-- --提交 556de1a 中,2017 年 12 月 28 日)stefanbeller
gitster

builtin/describe.c: describe a blob

Sometimes users are given a hash of an object and they want to identify it further (ex.: Use verify-packto find the largest blobs, but what are these? or this very SO question "Which commit has this blob?")

When describing commits, we try to anchor them to tags or refs, as these are conceptually on a higher level than the commit. And if there is no ref or tag that matches exactly, we're out of luck.
So we employ a heuristic to make up a name for the commit. These names are ambiguous, there might be different tags or refs to anchor to, and there might be different path in the DAG to travel to arrive at the commit precisely.

When describing a blob, we want to describe the blob from a higher layer as well, which is a tuple of (commit, deep/path)as the tree objects involved are rather uninteresting.
The same blob can be referenced by multiple commits, so how we decide which commit to use?

This patch implements a rather naive approach on this: As there are no back pointers from blobs to commits in which the blob occurs, we'll start walking from any tips available, listing the blobs in-order of the commit and once we found the blob, we'll take the first commit that listed the blob.

For example:

git describe --tags v0.99:Makefile
conversion-901-g7672db20c2:Makefile

tells us the Makefileas it was in v0.99was introduced in commit 7672db2.

The walking is performed in reverse order to show the introduction of a blob rather than its last occurrence.

builtin/describe.c: 描述一个 blob

有时用户会得到一个对象的哈希值,他们想进一步识别它(例如:verify-pack用于查找最大的 blob,但这些是什么?或者这个非常SO 的问题“哪个提交有这个 blob?”)

在描述提交时,我们尝试将它们锚定到标签或引用,因为它们在概念上比提交更高。如果没有完全匹配的引用或标签,我们就不走运了。
因此,我们采用启发式方法为提交命名。这些名称是不明确的,可能有不同的标签或引用要锚定到,并且 DAG 中可能有不同的路径来精确到达提交。

在描述 blob 时,我们也想从更高层描述 blob,这是一个元组,(commit, deep/path)因为所涉及的树对象相当无趣。
同一个 blob 可以被多个提交引用,那么我们如何决定使用哪个提交呢?

这个补丁在这方面实现了一种相当幼稚的方法:由于没有从 blob 到 blob 发生的提交的反向指针,我们将从任何可用提示开始,按提交顺序列出 blob,一旦我们找到blob,我们将采用列出 blob 的第一个提交

例如:

git describe --tags v0.99:Makefile
conversion-901-g7672db20c2:Makefile

告诉我们Makefile它是v0.99提交 7672db2中引入的。

行走以相反的顺序执行以显示 blob 的引入,而不是它的最后一次出现。

That means the git describeman pageadds to the purposes of this command:

这意味着git describe手册页增加了此命令的用途:

Instead of simply describing a commit using the most recent tag reachable from it, git describewill actually give an object a human readable name based on an available ref when used as git describe <blob>.

If the given object refers to a blob, it will be described as <commit-ish>:<path>, such that the blob can be found at <path>in the <commit-ish>, which itself describes the first commit in which this blob occurs in a reverse revision walk from HEAD.

不是简单地使用可从中获取的最新标记来描述提交,而是git describe在用作git describe <blob>.

如果给定的对象是指斑点,它将被描述为<commit-ish>:<path>,使得斑点可以被发现在<path><commit-ish>,这本身描述第一承诺,其中在从头部的反向版本步行发生此团块。

But:

但:

BUGS

Tree objects as well as tag objects not pointing at commits, cannot be described.
When describing blobs, the lightweight tags pointing at blobs are ignored, but the blob is still described as <committ-ish>:<path>despite the lightweight tag being favorable.

错误

无法描述树对象以及不指向提交的标记对象
在描述 blob 时,指向 blob 的轻量级标签将被忽略,但<committ-ish>:<path>尽管轻量级标签是有利的,但仍将 blob 描述为。

回答by Greg Hewgill

I thought this would be a generally useful thing to have, so I wrote up a little perl script to do it:

我认为这将是一个普遍有用的东西,所以我写了一个小 perl 脚本来做到这一点:

#!/usr/bin/perl -w

use strict;

my @commits;
my %trees;
my $blob;

sub blob_in_tree {
    my $tree = $_[0];
    if (defined $trees{$tree}) {
        return $trees{$tree};
    }
    my $r = 0;
    open(my $f, "git cat-file -p $tree|") or die $!;
    while (<$f>) {
        if (/^\d+ blob (\w+)/ &&  eq $blob) {
            $r = 1;
        } elsif (/^\d+ tree (\w+)/) {
            $r = blob_in_tree();
        }
        last if $r;
    }
    close($f);
    $trees{$tree} = $r;
    return $r;
}

sub handle_commit {
    my $commit = $_[0];
    open(my $f, "git cat-file commit $commit|") or die $!;
    my $tree = <$f>;
    die unless $tree =~ /^tree (\w+)$/;
    if (blob_in_tree()) {
        print "$commit\n";
    }
    while (1) {
        my $parent = <$f>;
        last unless $parent =~ /^parent (\w+)$/;
        push @commits, ;
    }
    close($f);
}

if (!@ARGV) {
    print STDERR "Usage: git-find-blob blob [head ...]\n";
    exit 1;
}

$blob = $ARGV[0];
if (@ARGV > 1) {
    foreach (@ARGV) {
        handle_commit($_);
    }
} else {
    handle_commit("HEAD");
}
while (@commits) {
    handle_commit(pop @commits);
}

I'll put this up on github when I get home this evening.

今晚回家后我会把它放在github上。

Update: It looks like somebody already did this. That one uses the same general idea but the details are different and the implementation is muchshorter. I don't know which would be faster but performance is probably not a concern here!

更新:看起来有人已经这样做了。那个使用相同的一般思想,但细节不同,实现短得多。我不知道哪个会更快,但这里的性能可能不是问题!

Update 2: For what it's worth, my implementation is orders of magnitude faster, especially for a large repository. That git ls-tree -rreally hurts.

更新 2:就其价值而言,我的实现速度要快几个数量级,尤其是对于大型存储库。那git ls-tree -r真的很痛。

Update 3: I should note that my performance comments above apply to the implementation I linked above in the first Update. Aristotle's implementationperforms comparably to mine. More details in the comments for those who are curious.

更新 3:我应该注意到我上面的性能评论适用于我在第一次更新中链接的实现。亚里士多德的实现与我的实现相当。好奇的人可以在评论中获得更多详细信息。

回答by Mario

While the original question does not ask for it, I think it is useful to also check the staging area to see if a blob is referenced. I modified the original bash script to do this and found what was referencing a corrupt blob in my repository:

虽然原始问题没有要求它,但我认为检查暂存区以查看是否引用了 blob 很有用。我修改了原始 bash 脚本以执行此操作,并在我的存储库中找到了引用损坏 blob 的内容:

#!/bin/sh
obj_name=""
shift
git ls-files --stage \
| if grep -q "$obj_name"; then
    echo Found in staging area. Run git ls-files --stage to see.
fi

git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
    if git ls-tree -r $tree | grep -q "$obj_name" ; then
        echo $commit "$subject"
    fi
done

回答by VonC

In addition of git describe, that I mention in my previous answer, git logand git diffnow benefits as well from the "--find-object=<object-id>" option to limit the findings to changes that involve the named object.
That is in Git 2.16.x/2.17 (Q1 2018)

另外的git describe,我提到我以前的答案git loggit diff现在福利,以及从“ --find-object=<object-id>”选项的结果限制为涉及命名对象的变化。
那是在 Git 2.16.x/2.17(2018 年第一季度)中

See commit 4d8c51a, commit 5e50525, commit 15af58c, commit cf63051, commit c1ddc46, commit 929ed70(04 Jan 2018) by Stefan Beller (stefanbeller).
(Merged by Junio C Hamano -- gitster--in commit c0d75f0, 23 Jan 2018)

请参阅Stefan Beller ( ) 的提交 4d8c51a提交 5e50525提交 15af58c提交 cf63051提交 c1ddc46提交 929ed70(2018 年 1 月 4 日(由Junio C Hamano合并-- --提交 c0d75f0 中,2018 年 1 月 23 日)stefanbeller
gitster

diffcore: add a pickaxe option to find a specific blob

Sometimes users are given a hash of an object and they want to identify it further (ex.: Use verify-pack to find the largest blobs, but what are these? or this Stack Overflow question "Which commit has this blob?")

One might be tempted to extend git-describeto also work with blobs, such that git describe <blob-id>gives a description as ':'.
This was implemented here; as seen by the sheer number of responses (>110), it turns out this is tricky to get right.
The hard part to get right is picking the correct 'commit-ish' as that could be the commit that (re-)introduced the blob or the blob that removed the blob; the blob could exist in different branches.

Junio hinted at a different approach of solving this problem, which this patch implements.
Teach the diffmachinery another flag for restricting the information to what is shown.
For example:

$ ./git log --oneline --find-object=v2.0.0:Makefile
  b2feb64 Revert the whole "ask curl-config" topic for now
  47fbfde i18n: only extract comments marked with "TRANSLATORS:"

we observe that the Makefileas shipped with 2.0was appeared in v1.9.2-471-g47fbfded53and in v2.0.0-rc1-5-gb2feb6430b.
The reason why these commits both occur prior to v2.0.0 are evil merges that are not found using this new mechanism.

diffcore: 添加一个镐选项来查找特定的 blob

有时用户会得到一个对象的哈希值,他们想进一步识别它(例如:使用 verify-pack 查找最大的 blob,但这些是什么?或者这个 Stack Overflow 问题“哪个提交有这个 blob?”)

人们可能会想扩展git-describe到也使用 blob,例如git describe <blob-id>将描述为“:”。
这是在这里实现的;从回复的绝对数量(> 110)可以看出,事实证明这是很难做到的。
正确的部分是选择正确的“commit-ish”,因为这可能是(重新)引入 blob 或删除 blob 的 blob 的提交;blob 可能存在于不同的分支中。

Junio 暗示了解决此问题的不同方法,该补丁实现了该方法。
diff机器另一个标志,以将信息限制为显示的内容。
例如:

$ ./git log --oneline --find-object=v2.0.0:Makefile
  b2feb64 Revert the whole "ask curl-config" topic for now
  47fbfde i18n: only extract comments marked with "TRANSLATORS:"

我们观察到Makefileas 随附2.0出现在 v1.9.2-471-g47fbfded53和 中v2.0.0-rc1-5-gb2feb6430b
这些提交都发生在 v2.0.0 之前的原因是使用这种新机制找不到的邪恶合并。

回答by cmyers

So... I needed to find all files over a given limit in a repo over 8GB in size, with over 108,000 revisions. I adapted Aristotle's perl script along with a ruby script I wrote to reach this complete solution.

所以......我需要在一个超过 8GB 的​​存储库中找到超过给定限制的所有文件,并有超过 108,000 次修订。我改编了亚里士多德的 perl 脚本以及我编写的 ruby​​ 脚本以达到这个完整的解决方案。

First, git gc- do this to ensure all objects are in packfiles - we don't scan objects not in pack files.

首先,git gc- 这样做以确保所有对象都在包文件中 - 我们不扫描不在包文件中的对象。

Next Run this script to locate all blobs over CUTOFF_SIZE bytes. Capture output to a file like "large-blobs.log"

Next 运行此脚本以定位 CUTOFF_SIZE 字节上的所有 blob。将输出捕获到“large-blobs.log”之类的文件中

#!/usr/bin/env ruby

require 'log4r'

# The output of git verify-pack -v is:
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
#
#
GIT_PACKS_RELATIVE_PATH=File.join('.git', 'objects', 'pack', '*.pack')

# 10MB cutoff
CUTOFF_SIZE=1024*1024*10
#CUTOFF_SIZE=1024

begin

  include Log4r
  log = Logger.new 'git-find-large-objects'
  log.level = INFO
  log.outputters = Outputter.stdout

  git_dir = %x[ git rev-parse --show-toplevel ].chomp

  if git_dir.empty?
    log.fatal "ERROR: must be run in a git repository"
    exit 1
  end

  log.debug "Git Dir: '#{git_dir}'"

  pack_files = Dir[File.join(git_dir, GIT_PACKS_RELATIVE_PATH)]
  log.debug "Git Packs: #{pack_files.to_s}"

  # For details on this IO, see http://stackoverflow.com/questions/1154846/continuously-read-from-stdout-of-external-process-in-ruby
  #
  # Short version is, git verify-pack flushes buffers only on line endings, so
  # this works, if it didn't, then we could get partial lines and be sad.

  types = {
    :blob => 1,
    :tree => 1,
    :commit => 1,
  }


  total_count = 0
  counted_objects = 0
  large_objects = []

  IO.popen("git verify-pack -v -- #{pack_files.join(" ")}") do |pipe|
    pipe.each do |line|
      # The output of git verify-pack -v is:
      # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
      data = line.chomp.split(' ')
      # types are blob, tree, or commit
      # we ignore other lines by looking for that
      next unless types[data[1].to_sym] == 1
      log.info "INPUT_THREAD: Processing object #{data[0]} type #{data[1]} size #{data[2]}"
      hash = {
        :sha1 => data[0],
        :type => data[1],
        :size => data[2].to_i,
      }
      total_count += hash[:size]
      counted_objects += 1
      if hash[:size] > CUTOFF_SIZE
        large_objects.push hash
      end
    end
  end

  log.info "Input complete"

  log.info "Counted #{counted_objects} totalling #{total_count} bytes."

  log.info "Sorting"

  large_objects.sort! { |a,b| b[:size] <=> a[:size] }

  log.info "Sorting complete"

  large_objects.each do |obj|
    log.info "#{obj[:sha1]} #{obj[:type]} #{obj[:size]}"
  end

  exit 0
end

Next, edit the file to remove any blobs you don't wait and the INPUT_THREAD bits at the top. once you have only lines for the sha1s you want to find, run the following script like this:

接下来,编辑文件以删除您没有等待的任何 blob 和顶部的 INPUT_THREAD 位。一旦您只有要查找的 sha1 的行,请像这样运行以下脚本:

cat edited-large-files.log | cut -d' ' -f4 | xargs git-find-blob | tee large-file-paths.log

Where the git-find-blobscript is below.

git-find-blob脚本如下。

#!/usr/bin/perl

# taken from: http://stackoverflow.com/questions/223678/which-commit-has-this-blob
# and modified by Carl Myers <[email protected]> to scan multiple blobs at once
# Also, modified to keep the discovered filenames
# vi: ft=perl

use 5.008;
use strict;
use Memoize;
use Data::Dumper;


my $BLOBS = {};

MAIN: {

    memoize 'check_tree';

    die "usage: git-find-blob <blob1> <blob2> ... -- [<git-log arguments ...>]\n"
        if not @ARGV;


    while ( @ARGV && $ARGV[0] ne '--' ) {
        my $arg = $ARGV[0];
        #print "Processing argument $arg\n";
        open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $arg or die "Couldn't open pipe to git-rev-parse: $!\n";
        my $obj_name = <$rev_parse>;
        close $rev_parse or die "Couldn't expand passed blob.\n";
        chomp $obj_name;
        #$obj_name eq $ARGV[0] or print "($ARGV[0] expands to $obj_name)\n";
        print "($arg expands to $obj_name)\n";
        $BLOBS->{$obj_name} = $arg;
        shift @ARGV;
    }
    shift @ARGV; # drop the -- if present

    #print "BLOBS: " . Dumper($BLOBS) . "\n";

    foreach my $blob ( keys %{$BLOBS} ) {
        #print "Printing results for blob $blob:\n";

        open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s'
            or die "Couldn't open pipe to git-log: $!\n";

        while ( <$log> ) {
            chomp;
            my ( $tree, $commit, $subject ) = split " ", $_, 3;
            #print "Checking tree $tree\n";
            my $results = check_tree( $tree );

            #print "RESULTS: " . Dumper($results);
            if (%{$results}) {
                print "$commit $subject\n";
                foreach my $blob ( keys %{$results} ) {
                    print "\t" . (join ", ", @{$results->{$blob}}) . "\n";
                }
            }
        }
    }

}


sub check_tree {
    my ( $tree ) = @_;
    #print "Calculating hits for tree $tree\n";

    my @subtree;

    # results = { BLOB => [ FILENAME1 ] }
    my $results = {};
    {
        open my $ls_tree, '-|', git => 'ls-tree' => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        # example git ls-tree output:
        # 100644 blob 15d408e386400ee58e8695417fbe0f858f3ed424    filaname.txt
        while ( <$ls_tree> ) {
            /\A[0-7]{6} (\S+) (\S+)\s+(.*)/
                or die "unexpected git-ls-tree output";
            #print "Scanning line '$_' tree  file \n";
            foreach my $blob ( keys %{$BLOBS} ) {
                if (  eq $blob ) {
                    print "Found $blob in $tree:\n";
                    push @{$results->{$blob}}, ;
                }
            }
            push @subtree, [, ] if  eq 'tree';
        }
    }

    foreach my $st ( @subtree ) {
        # $st->[0] is tree, $st->[1] is dirname
        my $st_result = check_tree( $st->[0] );
        foreach my $blob ( keys %{$st_result} ) {
            foreach my $filename ( @{$st_result->{$blob}} ) {
                my $path = $st->[1] . '/' . $filename;
                #print "Generating subdir path $path\n";
                push @{$results->{$blob}}, $path;
            }
        }
    }

    #print "Returning results for tree $tree: " . Dumper($results) . "\n\n";
    return $results;
}

The output will look like this:

输出将如下所示:

<hash prefix> <oneline log message>
    path/to/file.txt
    path/to/file2.txt
    ...
<hash prefix2> <oneline log msg...>

And so on. Every commit which contains a large file in its tree will be listed. if you grepout the lines that start with a tab, and uniqthat, you will have a list of all paths you can filter-branch to remove, or you can do something more complicated.

等等。将列出在其树中包含大文件的每个提交。如果您grep删除以选项卡开头的行uniq,那么您将拥有一个所有路径的列表,您可以过滤分支以删除,或者您可以做一些更复杂的事情。

Let me reiterate: this process ran successfully, on a 10GB repo with 108,000 commits. It took much longer than I predicted when running on a large number of blobs though, over 10 hours, I will have to see if the memorize bit is working...

让我重申一下:这个过程在一个 10GB 的 repo 上成功运行,有 108,000 次提交。在大量 blob 上运行时花费的时间比我预测的要长得多,超过 10 个小时,我将不得不查看记忆位是否正常工作......