git 如何在git存储库中找到N个最大的文件？

Question

提问by Sumit

I wanted to find the 10 largest files in my repository. The script I came up with is as follows:

我想在我的存储库中找到 10 个最大的文件。我想出的脚本如下：

REP_HOME_DIR=<top level git directory>
max_huge_files=10

cd ${REP_HOME_DIR}
git verify-pack -v ${REP_HOME_DIR}/.git/objects/pack/pack-*.idx | \
  grep blob | \
  sort -r -k 3 -n | \
  head -${max_huge_files} | \
  awk '{ system("printf \"%-80s \" `git rev-list --objects --all | grep "  " | cut -d\" \" -f2`"); printf "Size:%5d MB Size in pack file:%5d MB\n", /1048576,  /1048576; }'
cd -

Is there a better/more elegant way to do the same?

有没有更好/更优雅的方法来做同样的事情？

By "files" I mean the files that have been checked into the repository.

“文件”是指已签入存储库的文件。

Answer 1

回答by ypid

I found another way to do it:

我找到了另一种方法：

git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10

git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10

Quoted from: SO: git find fat commit

引自：SO: git find fat commit

Answer 2

回答by raphinesse

This bash "one-liner" displays the 10 largest blobs in the repository, sorted from smallest to largest. In contrast to the other answers, this includes allfiles tracked by the repository, even those not present in any branch tip.

这个 bash “one-liner” 显示存储库中 10 个最大的 blob，从最小到最大排序。与其他答案相反，这包括存储库跟踪的所有文件，甚至那些不存在于任何分支提示中的文件。

It's very fast, easy to copy & paste and only requires standard GNU utilities.

它非常快速，易于复制和粘贴，并且只需要标准的 GNU 实用程序。

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| tail -n 10 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

The first four lines implement the core functionality, the fifth limits the number of results, while the last two lines provide the nice human-readable outputthat looks like this:

前四行实现了核心功能，第五行限制了结果的数量，而最后两行提供了很好的人类可读的输出，如下所示：

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

For more information, including further filtering use cases and an output format more suitable for script processing, see my original answerto a similar question.

有关更多信息，包括进一步过滤用例和更适合脚本处理的输出格式，请参阅我对类似问题的原始回答。

macOS users: Since numfmtis not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

macOS 用户：由于numfmt在 macOS 上不可用，您可以省略最后一行并处理原始字节大小或brew install coreutils.

Answer 3

回答by pranithk

How about

怎么样

git ls-files | xargs ls -l | sort -nrk5 | head -n 10

git ls-files: List all the files in the repo
xargs ls -l: perform ls -lon all the files returned in git ls-files
sort -nrk5: Numerically reverse sort the lines based on 5th column
head -n 10: Print the top 10 lines

git ls-files: 列出 repo 中的所有文件
xargs ls -l:ls -l对返回的所有文件执行git ls-files
sort -nrk5: 根据第 5 列对行进行数字反向排序
head -n 10: 打印前 10 行

Answer 4

回答by studog

An improvement to raphinesse's answer, sort by size with largest first:

对 raphinesse 答案的改进，按大小排序，首先是最大的：

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr(git ls-tree -r -l --abbrev --full-name HEAD | Sort-Object {[int]($_ -split "\s+")[3]} | Select-Object -last 10
,6)}' \
| sort --numeric-sort --key=2 --reverse \
| head \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Answer 5

回答by pix64

Cannot comment. ypid's answer modified for powershell

无法评论。ypid 为 powershell 修改的答案

git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | Where-Object {$_ -like "blob*"} | Sort-Object {[int]($_ -split "\s+")[2]} | Select-Object -last 10

Edit raphinesse's solution(ish)

编辑 raphinesse 的解决方案（ish）

git rev-list --objects --all |
 git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
 Where-Object {$_ -like "blob*"} |
 % { $tokens = $_ -split "\|"; [pscustomobject]@{ Hash = $tokens[1]; Size = [int]($tokens[2]); Name = $tokens[3] } } |
 Sort-Object -Property Size -Descending |
 Select-Object -First 50

Answer 6

回答by UnionP

On Windows, I started with @pix64's answer (thanks!) and modified it to handle files with spaces in the path, and also to output objects instead of strings:

在 Windows 上，我从 @pix64 的回答开始（谢谢！）并将其修改为处理路径中带有空格的文件，并输出对象而不是字符串：

Format-Table Hash, Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

Even better, if you want to output the file sizes with nice file size units, you can add the DisplayInBytes function from here to your environment https://stackoverflow.com/a/40887001/892770, and then pipe the above to:

更好的是，如果您想以不错的文件大小单位输出文件大小，您可以从这里将 DisplayInBytes 函数添加到您的环境https://stackoverflow.com/a/40887001/892770，然后将上述内容通过管道传输到：

Hash                                     Name                                        Size
----                                     ----                                        ----
f51371aa843279a1efe45ff14f3dc3ec5f6b2322 types/react-native-snackbar-component/react 95.8 MB
84f3d727f6b8f99ab4698da51f9e507ae4cd8879 .ntvs_analysis.dat                          94.5 MB
17d734397dcd35fdbd715d29ef35860ecade88cd fhir/fhir-tests.ts                          11.5 KB
4c6a027cdbce093fd6ae15e65576cc8d81cec46c fhir/fhir-tests.ts                          11.4 KB

This gives you output like:

这为您提供如下输出：

git rev-list --objects --all |
 git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
 Where-Object {$_ -like "blob*"} |
 % { $tokens = $_ -split "\|"; [pscustomobject]@{ Size = [int]($tokens[2]); Extension = [System.IO.Path]::GetExtension($tokens[3]) } } |
 Group-Object -Property Extension |
 % { [pscustomobject]@{ Name = $_.Name; Size = ($_.Group | Measure-Object Size -Sum).Sum } } |
 Sort-Object -Property Size -Descending |
 select -First 20 -Property Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}

Lastly, if you'd like to get all the largest file types, you can do so with:

最后，如果你想获得所有最大的文件类型，你可以这样做：

ls -lSh `git ls-files` | head

Answer 7

回答by tsvikas

For completion, here's the method I found:

为了完成，这是我找到的方法：

find * -type f -size +100M -print0 | xargs -0 git ls-files

The optional -hprints the size in human-readable format.

可选-h以人类可读的格式打印尺寸。

Answer 8

回答by First Zero

You can also use du- Example: du -ah objects | sort -n -r | head -n 10. du to get the size of the objects, sortthem and then picking the top 10 using head.

您还可以使用du- 示例：du -ah objects | sort -n -r | head -n 10. du 获取对象的大小，sort然后使用head.

Answer 9

回答by Joey Adams

You can use findto find files larger than a given threshold, then pass them to git ls-filesto exclude untracked files (e.g. build output):

您可以使用find查找大于给定阈值的文件，然后将它们传递给以git ls-files排除未跟踪的文件（例如构建输出）：

##代码##

Adjust 100M (100 megabytes) as needed until you get results.

根据需要调整 100M（100 兆字节），直到获得结果。

Minor caveat: this won't search top-level "hidden" files and folders (i.e. those whose names start with .). This is because I used find *instead of just findto avoid searching the .gitdatabase.

小警告：这不会搜索顶级“隐藏”文件和文件夹（即名称以开头的文件和文件夹.）。这是因为我使用find *而不仅仅是find为了避免搜索.git数据库。

I was having trouble getting the sort -nsolutions to work (on Windows under Git Bash). I'm guessing it's due to indentation differences when xargs batches arguments, which xargs -0seems to do automatically to work around Windows' command-line length limit of 32767.

我在获得sort -n解决方案时遇到了麻烦（在 Git Bash 下的 Windows 上）。我猜这是由于 xargs 批处理参数时的缩进差异，这xargs -0似乎可以自动解决 Windows 的命令行长度限制 32767。

git 如何在git存储库中找到N个最大的文件？

提问by Sumit

回答by ypid

回答by raphinesse

回答by pranithk

回答by studog

回答by pix64

回答by UnionP

回答by tsvikas

回答by First Zero

回答by Joey Adams

相关推荐

最近更新

标签

git 如何在git存储库中找到N个最大的文件？

提问by Sumit

回答by ypid

回答by raphinesse

回答by pranithk

回答by studog

回答by pix64

回答by UnionP

回答by tsvikas

回答by First Zero

回答by Joey Adams

相关推荐

如何在 git 上手动运行钩子 post-receive？

通过 SFTP 克隆 Git 存储库

适用于 Windows 的 Git bash 不提示输入密码

git Git合并冲突后，很多我没有接触的文件变成了要提交的更改

相关推荐

最近更新

标签