Git - 如何列出数据库中的所有对象

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/7348698/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-10 11:48:57  来源:igfitidea点击:

Git - how to list ALL objects in the database

gitgit-showgit-rev-list

提问by kbro

Is there a better way of getting a raw list of SHA1s for ALL objects in a repository than doing ls .git/objects/??/\*and cat .git/objects/pack/*.idx | git show-index?

有没有比执行ls .git/objects/??/\*and更好的方法来获取存储库中所有对象的原始 SHA1 列表cat .git/objects/pack/*.idx | git show-index

I know about git rev-list --allbut that only lists commit objects that are referenced by .git/refs, and I'm looking for everythingincluding unreferenced objects that are created by git-hash-object, git-mktree etc.

我知道git rev-list --all但只列出了 .git/refs 引用的提交对象,我正在寻找所有内容,包括由 git-hash-object、git-mktree 等创建的未引用对象。

采纳答案by willkil

Edit:Aristotleposted an even better answer, which should be marked as correct.

编辑:亚里士多德发布了一个更好的答案,应该标记为正确。

Edit:the script contained a syntax error, missing backslash at the end of the grep -vline

编辑:脚本包含语法错误,grep -v行尾缺少反斜杠

Mark's answer worked for me, after a few modifications:

经过一些修改后,马克的回答对我有用:

  • Used --git-dirinstead of --show-cdupto support bare repos
  • Avoided error when there are no packs
  • Used perlbecause OS X Mountain Lion's BSD-style seddoesn't support -r
  • 用于--git-dir代替--show-cdup支持裸仓库
  • 没有包时避免错误
  • 使用perl是因为 OS X Mountain Lion 的 BSD 风格sed不支持-r

#!/bin/sh

set -e

cd "$(git rev-parse --git-dir)"

# Find all the objects that are in packs:

find objects/pack -name 'pack-*.idx' | while read p ; do
    git show-index < $p | cut -f 2 -d ' '
done

# And now find all loose objects:

find objects/ \
    | egrep '[0-9a-f]{38}' \
    | grep -v /pack/ \
    | perl -pe 's:^.*([0-9a-f][0-9a-f])/([0-9a-f]{38})::' \
;

回答by sehe

Try

尝试

 git rev-list --objects --all


EditJosh made a good point:

编辑乔希提出了一个很好的观点:

 git rev-list --objects -g --no-walk --all

list objects reachable from the ref-logs.

列出可从引用日志访问的对象。

To see all objects in unreachable commits as well:

要查看无法访问的提交中的所有对象,请执行以下操作:

 git rev-list --objects --no-walk \
      $(git fsck --unreachable |
        grep '^unreachable commit' |
        cut -d' ' -f3)


Putting it all together, to reallyget all objects in the output format of rev-list --objects, you need something like

把它们放在一起,要真正获得输出格式的所有对象rev-list --objects,你需要像

{
    git rev-list --objects --all
    git rev-list --objects -g --no-walk --all
    git rev-list --objects --no-walk \
        $(git fsck --unreachable |
          grep '^unreachable commit' |
          cut -d' ' -f3)
} | sort | uniq

To sort the output in slightly more useful way (by path for tree/blobs, commits first) use an additional | sort -k2which will group all different blobs (revisions) for identical paths.

要以更有用的方式对输出进行排序(按树/blob 的路径,首先提交)使用附加的| sort -k2,它将对相同路径的所有不同 blob(修订版)进行分组。

回答by Erki der Loony

I don't know since when this option exists but you can

我不知道这个选项什么时候存在,但你可以

git cat-file --batch-check --batch-all-objects

This gives you, according to the man page,

根据手册页,这给了你,

all objectsin the repository and any alternate object stores (not just reachable objects)

存储库中的所有对象和任何备用对象存储(不仅仅是可访问的对象

(emphasis mine).

(强调我的)。

By default this yields the object type and it's size together with each hash but you can easily remove this information, e.g. with

默认情况下,这会产生对象类型及其大小以及每个散列,但您可以轻松删除此信息,例如

git cat-file --batch-check --batch-all-objects | cut -d' ' -f1

or by giving a custom format to --batch-check.

或通过为--batch-check.

回答by Aristotle Pagaltzis

This is a more correct, simpler, and faster rendition of the script from the answers by Markand by willkill.

这是根据 Markwillkill的答案对脚本的更正确、更简单和更快的再现。

  • It uses rev-parse --git-pathto find the objectsdirectory even in a more complex Git repository setup (e.g. in a multi-worktree situation or whatnot).

  • It avoids all unnecessary use of find, grep, perl, sed.

  • If works gracefully even if you have no loose objects or no packs (or neither… if you're inclined to run this on a fresh repository).

  • It does, however, require a Bash from this millennium (2.02 or newer, specifically, for the extglobbit).

  • 即使在更复杂的 Git 存储库设置中(例如,在多工作树情况下或诸如此类的情况下),它也用于rev-parse --git-path查找objects目录。

  • 它避免了所有不必要的使用find, grep, perl, sed

  • 即使您没有松散的对象或包(或者两者都没有……如果您倾向于在新的存储库上运行它),If 也能正常工作。

  • 但是,它确实需要本千年的 Bash(2.02 或更高版本,特别是对于该extglob位)。

Share and enjoy.

分享和享受。

#!/bin/bash
set -e
shopt -s nullglob extglob

cd "`git rev-parse --git-path objects`"

# packed objects
for p in pack/pack-*([0-9a-f]).idx ; do
    git show-index < $p | cut -f 2 -d ' '
done

# loose objects
for o in [0-9a-f][0-9a-f]/*([0-9a-f]) ; do
    echo ${o/\/}
done

回答by VonC

The git cat-file --batch-check --batch-all-objectscommand, suggested in Erki Der Loony's answer, can be made fasterwith the new Git 2.19 (Q3 2018) option --unordered.

Erki Der Loony回答中git cat-file --batch-check --batch-all-objects建议的命令可以使用新的 Git 2.19 (Q3 2018) 选项加快速度--unordered

The API to iterate over all objects learned to optionally list objects in the order they appear in packfiles, which helps locality of access if the caller accesses these objects while as objects are enumerated.

迭代所有对象的 API 学会了按对象出现在 packfiles 中的顺序选择性地列出对象,如果调用者在枚举对象时访问这些对象,这有助于访问的局部性。

See commit 0889aae, commit 79ed0a5, commit 54d2f0d, commit ced9fff(14 Aug 2018), and commit 0750bb5, commit b1adb38, commit aa2f5ef, commit 736eb88, commit 8b36155, commit a7ff6f5, commit 202e7f1(10 Aug 2018) by Jeff King (peff). (Merged by Junio C Hamano -- gitster--in commit 0c54cda, 20 Aug 2018)

提交0889aae提交79ed0a5提交54d2f0d提交ced9fff(2018年8月14日),并提交0750bb5提交b1adb38提交aa2f5ef提交736eb88提交8b36155提交a7ff6f5提交202e7f1(2018年8月10日),由杰夫·王(peff(由Junio C gitsterHamano合并-- --commit 0c54cda,2018 年 8 月 20 日)

cat-file: support "unordered" output for --batch-all-objects

If you're going to access the contents of every object in a packfile, it's generally much more efficient to do so in pack order, rather than in hash order. That increases the locality of access within the packfile, which in turn is friendlier to the delta base cache, since the packfile puts related deltas next to each other. By contrast, hash order is effectively random, since the sha1 has no discernible relationship to the content.

This patch introduces an "--unordered" option to cat-filewhich iterates over packs in pack-order under the hood. You can see the results when dumping all of the file content:

$ time ./git cat-file --batch-all-objects --buffer --batch | wc -c
  6883195596

real 0m44.491s
user 0m42.902s
sys  0m5.230s

$ time ./git cat-file --unordered \
                    --batch-all-objects --buffer --batch | wc -c
  6883195596

real 0m6.075s
user 0m4.774s
sys  0m3.548s

Same output, different order, way faster. The same speed-up applies even if you end up accessing the object content in a different process, like:

git cat-file --batch-all-objects --buffer --batch-check |
grep blob |
git cat-file --batch='%(objectname) %(rest)' |
wc -c

Adding "--unordered" to the first command drops the runtime in git.gitfrom 24s to 3.5s.

Side note: there are actually further speedups available for doing it all in-process now. Since we are outputting the object content during the actual pack iteration, we know where to find the object and could skip the extra lookup done by oid_object_info(). This patch stops short of that optimization since the underlying API isn't ready for us to make those sorts of direct requests.

So if --unorderedis so much better, why not make it the default? Two reasons:

  1. We've promised in the documentation that --batch-all-objectsoutputs in hash order. Since cat-fileis plumbing, people may be relying on that default, and we can't change it.

  2. It's actually slowerfor some cases. We have to compute the pack revindex to walk in pack order. And our de-duplication step uses an oidset, rather than a sort-and-dedup, which can end up being more expensive.

If we're just accessing the type and size of each object, for example, like:

git cat-file --batch-all-objects --buffer --batch-check

my best-of-five warm cache timings go from 900ms to 1100ms using --unordered. Though it's possible in a cold-cache or under memory pressure that we could do better, since we'd have better locality within the packfile.

And one final question: why is it "--unordered" and not "--pack-order"? The answer is again two-fold:

  1. "pack order" isn't a well-defined thing across the whole set of objects. We're hitting loose objects, as well as objects in multiple packs, and the only ordering we're promising is withina single pack. The rest is apparently random.

  2. The point here is optimization. So we don't want to promise any particular ordering, but only to say that we will choose an ordering which is likely to be efficient for accessing the object content. That leaves the door open for further changes in the future without having to add another compatibility option

cat-file: 支持 " unordered" 输出--batch-all-objects

如果您要访问 packfile 中每个对象的内容,通常按 pack order 而不是按 hash order 这样做更有效率。这增加了包文件中访问的局部性,这反过来对增量基本缓存更友好,因为包文件将相关的增量放在一起。相比之下,哈希顺序实际上是随机的,因为 sha1 与内容没有明显的关系。

这个补丁引入了一个“ --unordered”选项,cat-file它在引擎盖下按包顺序迭代包。转储所有文件内容时,您可以看到结果:

$ time ./git cat-file --batch-all-objects --buffer --batch | wc -c
  6883195596

real 0m44.491s
user 0m42.902s
sys  0m5.230s

$ time ./git cat-file --unordered \
                    --batch-all-objects --buffer --batch | wc -c
  6883195596

real 0m6.075s
user 0m4.774s
sys  0m3.548s

相同的输出,不同的顺序,速度更快。即使您最终在不同的过程中访问对象内容,同样的加速也适用,例如:

git cat-file --batch-all-objects --buffer --batch-check |
grep blob |
git cat-file --batch='%(objectname) %(rest)' |
wc -c

将“ --unordered”添加到第一个命令会将运行时间git.git从 24 秒降低到 3.5 秒。

旁注:现在实际上有进一步的加速可用于在进程中完成所有操作。由于我们在实际包迭代期间输出对象内容,因此我们知道在哪里可以找到对象并且可以跳过由oid_object_info(). 这个补丁没有进行优化,因为底层 API 还没有准备好让我们发出这些类型的直接请求。

因此,如果--unordered要好得多,为什么不将其设为默认值?两个原因:

  1. 我们在文档中承诺--batch-all-objects按哈希顺序输出。由于cat-file是管道,人们可能依赖于该默认值,我们无法更改它。

  2. 在某些情况下,它实际上更慢。我们必须计算包装 revindex 以按包装顺序行走。并且我们的重复数据删除步骤使用 oidset,而不是排序和重复数据删除,后者最终可能会更昂贵。

如果我们只是访问每个对象的类型和大小,例如,像:

git cat-file --batch-all-objects --buffer --batch-check

我最好的五个热缓存时间从 900 毫秒到 1100 毫秒使用--unordered. 尽管在冷缓存或内存压力下我们可以做得更好,因为我们在包文件中有更好的位置。

最后一个问题:为什么是“ --unordered”而不是“ --pack-order”?答案又是双重的:

  1. “打包顺序”在整个对象集中并不是一个明确定义的东西。我们正在击打松散的物体以及多个包装中的物体,我们承诺的唯一顺序是单个包装内。其余的显然是随机的。

  2. 这里的重点是优化。所以我们不想承诺任何特定的排序,而只是说我们将选择一个可能对访问对象内容有效的排序。这为将来的进一步更改敞开了大门,而无需添加另一个兼容性选项



It is even faster in Git 2.20 (Q4 2018) with:

在 Git 2.20(2018 年第 4 季度)中速度更快:

See commit 8c84ae6, commit 8b2f8cb, commit 9249ca2, commit 22a1646, commit bf73282(04 Oct 2018) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster--in commit 82d0a8c, 19 Oct 2018)

请参阅René Scharfe ( ) 的commit 8c84ae6commit 8b2f8cbcommit 9249ca2commit 22a1646commit bf73282(2018 年 10 月 4 日(由Junio C Hamano合并-- --提交 82d0a8c 中,2018 年 10 月 19 日)rscharfe
gitster

oidset: use khash

Reimplement oidsetusing khash.hin order to reduce its memory footprint and make it faster.

Performance of a command that mainly checks for duplicate objects using an oidset, with masterand Clang 6.0.1:

$ cmd="./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'"

$ /usr/bin/time $cmd >/dev/null
0.22user 0.03system 0:00.25elapsed 99%CPU (0avgtext+0avgdata 48484maxresident)k
0inputs+0outputs (0major+11204minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

Time (mean ± σ):     250.0 ms ±   6.0 ms    [User: 225.9 ms, System: 23.6 ms]

Range (min … max):   242.0 ms … 261.1 ms

oidset: 用 khash

重新实现oidsetusingkhash.h以减少其内存占用并使其更快。

主要使用 oidsetmaster和 Clang 6.0.1检查重复对象的命令的性能:

$ cmd="./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'"

$ /usr/bin/time $cmd >/dev/null
0.22user 0.03system 0:00.25elapsed 99%CPU (0avgtext+0avgdata 48484maxresident)k
0inputs+0outputs (0major+11204minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

Time (mean ± σ):     250.0 ms ±   6.0 ms    [User: 225.9 ms, System: 23.6 ms]

Range (min … max):   242.0 ms … 261.1 ms

And with this patch:

有了这个补丁:

$ /usr/bin/time $cmd >/dev/null
0.14user 0.00system 0:00.15elapsed 100%CPU (0avgtext+0avgdata 41396maxresident)k
0inputs+0outputs (0major+8318minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

Time (mean ± σ):     151.9 ms ±   4.9 ms    [User: 130.5 ms, System: 21.2 ms]

Range (min … max):   148.2 ms … 170.4 ms
$ /usr/bin/time $cmd >/dev/null
0.14user 0.00system 0:00.15elapsed 100%CPU (0avgtext+0avgdata 41396maxresident)k
0inputs+0outputs (0major+8318minor)pagefaults 0swaps

$ hyperfine "$cmd"
Benchmark #1: ./git-cat-file --batch-all-objects --unordered --buffer --batch-check='%(objectname)'

Time (mean ± σ):     151.9 ms ±   4.9 ms    [User: 130.5 ms, System: 21.2 ms]

Range (min … max):   148.2 ms … 170.4 ms


Git 2.21 (Q1 2019) optimizes further the codepath to write out commit-graph, by following the usual pattern of visiting objects in in-pack order.

Git 2.21(2019 年第一季度)通过遵循按打包顺序访问对象的通常模式,进一步优化了代码路径以写出提交图。

See commit d7574c9(19 Jan 2019) by ?var Arnfj?re Bjarmason (avar).
(Merged by Junio C Hamano -- gitster--in commit 04d67b6, 05 Feb 2019)

请参阅?var Arnfj?re Bjarmason ( ) 的commit d7574c9(2019 年 1 月 19 日(由Junio C Hamano合并-- --04d67b6 提交中,2019 年 2 月 5 日)avar
gitster

Slightly optimize the "commit-graph write" step by using FOR_EACH_OBJECT_PACK_ORDERwith for_each_object_in_pack().
Derrick Stolee did his own tests on Windowsshowing a 2% improvement with a high degree of accuracy.

使用FOR_EACH_OBJECT_PACK_ORDERwith稍微优化“提交图写入”步骤 for_each_object_in_pack()
Derrick Stolee在 Windows 上进行了自己的测试,结果显示提高了 2%,并且具有很高的准确性。



Git 2.23 (Q3 2019) improves "git rev-list --objects" which learned with "--no-object-names" option to squelch the path to the object that is used as a grouping hint for pack-objects.

Git 2.23(2019 年第 3 季度)改进了“ git rev-list --objects”,它通过“ --no-object-names”选项学习压制用作包对象分组提示的对象的路径。

See commit 42357b4(19 Jun 2019) by Emily Shaffer (nasamuffin).
(Merged by Junio C Hamano -- gitster--in commit f4f7e75, 09 Jul 2019)

请参阅Emily Shaffer ( ) 的提交 42357b4(2019 年 6 月 19 日(由Junio C Hamano合并-- --提交 f4f7e75 中,2019 年 7 月 9 日)nasamuffin
gitster

rev-list: teach --no-object-namesto enable piping

Allow easier parsing by cat-fileby giving rev-list an option to print only the OID of a non-commit object without any additional information.
This is a short-term shim; later on, rev-listshould be taught how to print the types of objects it finds in a format similar to cat-file's.

Before this commit, the output from rev-listneeded to be massaged before being piped to cat-file, like so:

git rev-list --objects HEAD | cut -f 1 -d ' ' |
    git cat-file --batch-check

This was especially unexpected when dealing with root trees, as an invisible whitespace exists at the end of the OID:

git rev-list --objects --filter=tree:1 --max-count=1 HEAD |
    xargs -I% echo "AA%AA"

Now, it can be piped directly, as in the added test case:

git rev-list --objects --no-object-names HEAD | git cat-file --batch-check

rev-list: 教--no-object-names启用管道

cat-file通过为 rev-list 提供一个选项来仅打印非提交对象的 OID 而没有任何附加信息,从而允许更容易的解析。
这是一个短期的垫片;稍后,rev-list应该教如何以类似于cat-file's的格式打印它找到的对象类型。

在此提交之前,输出rev-list需要在通过管道传输到 cat 文件之前进行处理,如下所示:

git rev-list --objects HEAD | cut -f 1 -d ' ' |
    git cat-file --batch-check

这在处理根树时尤其出乎意料,因为 OID 末尾存在一个不可见的空格:

git rev-list --objects --filter=tree:1 --max-count=1 HEAD |
    xargs -I% echo "AA%AA"

现在,它可以直接通过管道传输,就像在添加的测试用例中一样:

git rev-list --objects --no-object-names HEAD | git cat-file --batch-check

So that is the difference between:

所以这就是以下之间的区别:

vonc@vonvb:~/gits/src/git$ git rev-list --objects HEAD~1..
9d418600f4d10dcbbfb0b5fdbc71d509e03ba719
590f2375e0f944e3b76a055acd2cb036823d4b44 
55d368920b2bba16689cb6d4aef2a09e8cfac8ef Documentation
9903384d43ab88f5a124bc667f8d6d3a8bce7dff Documentation/RelNotes
a63204ffe8a040479654c3e44db6c170feca2a58 Documentation/RelNotes/2.23.0.txt

And, with --no-object-name:

并且,与--no-object-name

vonc@vonvb:~/gits/src/git$ git rev-list --objects --no-object-names HEAD~1..
9d418600f4d10dcbbfb0b5fdbc71d509e03ba719
590f2375e0f944e3b76a055acd2cb036823d4b44
55d368920b2bba16689cb6d4aef2a09e8cfac8ef
9903384d43ab88f5a124bc667f8d6d3a8bce7dff
a63204ffe8a040479654c3e44db6c170feca2a58

回答by Mark Longair

I don't know of an obviously better way than just looking at all the loose object files and the indices of all pack files. The format of the git repository is very stable, and with this method you don't have to rely on having exactly the right options to git fsck, which is classed as porcelain. I think this method is faster, as well. The following script shows all the objects in a repository:

我不知道除了查看所有松散的目标文件和所有包文件的索引之外,还有什么明显更好的方法。git 存储库的格式非常稳定,使用这种方法,您不必依赖于git fsck. 我认为这种方法也更快。以下脚本显示了存储库中的所有对象:

#!/bin/sh

set -e

cd "$(git rev-parse --show-cdup)"

# Find all the objects that are in packs:

for p in .git/objects/pack/pack-*.idx
do
    git show-index < $p | cut -f 2 -d ' '
done

# And now find all loose objects:

find .git/objects/ | egrep '[0-9a-f]{38}' | \
  sed -r 's,^.*([0-9a-f][0-9a-f])/([0-9a-f]{38}),,'

(My original version of this script was based on this useful script to find the largest objects in your pack files, but I switched to using git show-index, as suggested in your question.)

(我这个脚本的原始版本基于这个有用的脚本来查找包文件中最大的对象,但我切换到使用git show-index,如你的问题中所建议的那样。)

I've made this script into a GitHub gist.

我已经把这个脚本变成了一个 GitHub gist

回答by nimrodm

Another useful option is to use git verify-pack -v <packfile>

另一个有用的选择是使用 git verify-pack -v <packfile>

verify-pack -vlists all objects in the database along with their object type.

verify-pack -v列出数据库中的所有对象及其对象类型。