git 如何在git存储库中找到N个最大的文件?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/9456550/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-19 06:32:28  来源:igfitidea点击:

How to find the N largest files in a git repository?

git

提问by Sumit

I wanted to find the 10 largest files in my repository. The script I came up with is as follows:

我想在我的存储库中找到 10 个最大的文件。我想出的脚本如下:

REP_HOME_DIR=<top level git directory>
max_huge_files=10

cd ${REP_HOME_DIR}
git verify-pack -v ${REP_HOME_DIR}/.git/objects/pack/pack-*.idx | \
  grep blob | \
  sort -r -k 3 -n | \
  head -${max_huge_files} | \
  awk '{ system("printf \"%-80s \" `git rev-list --objects --all | grep "  " | cut -d\" \" -f2`"); printf "Size:%5d MB Size in pack file:%5d MB\n", /1048576,  /1048576; }'
cd -

Is there a better/more elegant way to do the same?

有没有更好/更优雅的方法来做同样的事情?

By "files" I mean the files that have been checked into the repository.

“文件”是指已签入存储库的文件。

回答by ypid

I found another way to do it:

我找到了另一种方法:

git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10
git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10

Quoted from: SO: git find fat commit

引自:SO: git find fat commit

回答by raphinesse

This bash "one-liner" displays the 10 largest blobs in the repository, sorted from smallest to largest. In contrast to the other answers, this includes allfiles tracked by the repository, even those not present in any branch tip.

这个 bash “one-liner” 显示存储库中 10 个最大的 blob,从最小到最大排序。与其他答案相反,这包括存储库跟踪的所有文件,甚至那些不存在于任何分支提示中的文件。

It's very fast, easy to copy & paste and only requires standard GNU utilities.

非常快速,易于复制和粘贴,并且只需要标准的 GNU 实用程序。

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| tail -n 10 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The first four lines implement the core functionality, the fifth limits the number of results, while the last two lines provide the nice human-readable outputthat looks like this:

前四行实现了核心功能,第五行限制了结果的数量,而最后两行提供了很好的人类可读的输出,如下所示:

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

For more information, including further filtering use cases and an output format more suitable for script processing, see my original answerto a similar question.

有关更多信息,包括进一步过滤用例和更适合脚本处理的输出格式,请参阅我对类似问题的原始回答

macOS users: Since numfmtis not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

macOS 用户:由于numfmt在 macOS 上不可用,您可以省略最后一行并处理原始字节大小或brew install coreutils.

回答by pranithk

How about

怎么样

git ls-files | xargs ls -l | sort -nrk5 | head -n 10
  • git ls-files: List all the files in the repo
  • xargs ls -l: perform ls -lon all the files returned in git ls-files
  • sort -nrk5: Numerically reverse sort the lines based on 5th column
  • head -n 10: Print the top 10 lines
  • git ls-files: 列出 repo 中的所有文件
  • xargs ls -l:ls -l对返回的所有文件执行git ls-files
  • sort -nrk5: 根据第 5 列对行进行数字反向排序
  • head -n 10: 打印前 10 行

回答by studog

An improvement to raphinesse's answer, sort by size with largest first:

对 raphinesse 答案的改进,按大小排序,首先是最大的:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr(
git ls-tree -r -l --abbrev --full-name HEAD | Sort-Object {[int]($_ -split "\s+")[3]} | Select-Object -last 10
,6)}' \ | sort --numeric-sort --key=2 --reverse \ | head \ | cut --complement --characters=13-40 \ | numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

回答by pix64

Cannot comment. ypid's answer modified for powershell

无法评论。ypid 为 powershell 修改的答案

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | Where-Object {$_ -like "blob*"} | Sort-Object {[int]($_ -split "\s+")[2]} | Select-Object -last 10

Edit raphinesse's solution(ish)

编辑 raphinesse 的解决方案(ish)

git rev-list --objects --all |
 git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
 Where-Object {$_ -like "blob*"} |
 % { $tokens = $_ -split "\|"; [pscustomobject]@{ Hash = $tokens[1]; Size = [int]($tokens[2]); Name = $tokens[3] } } |
 Sort-Object -Property Size -Descending |
 Select-Object -First 50

回答by UnionP

On Windows, I started with @pix64's answer (thanks!) and modified it to handle files with spaces in the path, and also to output objects instead of strings:

在 Windows 上,我从 @pix64 的回答开始(谢谢!)并将其修改为处理路径中带有空格的文件,并输出对象而不是字符串:

Format-Table Hash, Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

Even better, if you want to output the file sizes with nice file size units, you can add the DisplayInBytes function from here to your environment https://stackoverflow.com/a/40887001/892770, and then pipe the above to:

更好的是,如果您想以不错的文件大小单位输出文件大小,您可以从这里将 DisplayInBytes 函数添加到您的环境https://stackoverflow.com/a/40887001/892770,然后将上述内容通过管道传输到:

Hash                                     Name                                        Size
----                                     ----                                        ----
f51371aa843279a1efe45ff14f3dc3ec5f6b2322 types/react-native-snackbar-component/react 95.8 MB
84f3d727f6b8f99ab4698da51f9e507ae4cd8879 .ntvs_analysis.dat                          94.5 MB
17d734397dcd35fdbd715d29ef35860ecade88cd fhir/fhir-tests.ts                          11.5 KB
4c6a027cdbce093fd6ae15e65576cc8d81cec46c fhir/fhir-tests.ts                          11.4 KB

This gives you output like:

这为您提供如下输出:

git rev-list --objects --all |
 git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
 Where-Object {$_ -like "blob*"} |
 % { $tokens = $_ -split "\|"; [pscustomobject]@{ Size = [int]($tokens[2]); Extension = [System.IO.Path]::GetExtension($tokens[3]) } } |
 Group-Object -Property Extension |
 % { [pscustomobject]@{ Name = $_.Name; Size = ($_.Group | Measure-Object Size -Sum).Sum } } |
 Sort-Object -Property Size -Descending |
 select -First 20 -Property Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

Lastly, if you'd like to get all the largest file types, you can do so with:

最后,如果你想获得所有最大的文件类型,你可以这样做:

ls -lSh `git ls-files` | head

回答by tsvikas

For completion, here's the method I found:

为了完成,这是我找到的方法:

find * -type f -size +100M -print0 | xargs -0 git ls-files

The optional -hprints the size in human-readable format.

可选-h以人类可读的格式打印尺寸。

回答by First Zero

You can also use du- Example: du -ah objects | sort -n -r | head -n 10. du to get the size of the objects, sortthem and then picking the top 10 using head.

您还可以使用du- 示例:du -ah objects | sort -n -r | head -n 10. du 获取对象的大小,sort然后使用head.

回答by Joey Adams

You can use findto find files larger than a given threshold, then pass them to git ls-filesto exclude untracked files (e.g. build output):

您可以使用find查找大于给定阈值的文件,然后将它们传递给以git ls-files排除未跟踪的文件(例如构建输出):

##代码##

Adjust 100M (100 megabytes) as needed until you get results.

根据需要调整 100M(100 兆字节),直到获得结果。

Minor caveat: this won't search top-level "hidden" files and folders (i.e. those whose names start with .). This is because I used find *instead of just findto avoid searching the .gitdatabase.

小警告:这不会搜索顶级“隐藏”文件和文件夹(即名称以 开头的文件和文件夹.)。这是因为我使用find *而不仅仅是find为了避免搜索.git数据库。

I was having trouble getting the sort -nsolutions to work (on Windows under Git Bash). I'm guessing it's due to indentation differences when xargs batches arguments, which xargs -0seems to do automatically to work around Windows' command-line length limit of 32767.

我在获得sort -n解决方案时遇到了麻烦(在 Git Bash 下的 Windows 上)。我猜这是由于 xargs 批处理参数时的缩进差异,这xargs -0似乎可以自动解决 Windows 的命令行长度限制 32767。