git 如何在git存储库中找到N个最大的文件?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/9456550/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find the N largest files in a git repository?
提问by Sumit
I wanted to find the 10 largest files in my repository. The script I came up with is as follows:
我想在我的存储库中找到 10 个最大的文件。我想出的脚本如下:
REP_HOME_DIR=<top level git directory>
max_huge_files=10
cd ${REP_HOME_DIR}
git verify-pack -v ${REP_HOME_DIR}/.git/objects/pack/pack-*.idx | \
grep blob | \
sort -r -k 3 -n | \
head -${max_huge_files} | \
awk '{ system("printf \"%-80s \" `git rev-list --objects --all | grep " " | cut -d\" \" -f2`"); printf "Size:%5d MB Size in pack file:%5d MB\n", /1048576, /1048576; }'
cd -
Is there a better/more elegant way to do the same?
有没有更好/更优雅的方法来做同样的事情?
By "files" I mean the files that have been checked into the repository.
“文件”是指已签入存储库的文件。
回答by ypid
I found another way to do it:
我找到了另一种方法:
git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10
git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 | tail -n 10
Quoted from: SO: git find fat commit
回答by raphinesse
This bash "one-liner" displays the 10 largest blobs in the repository, sorted from smallest to largest. In contrast to the other answers, this includes allfiles tracked by the repository, even those not present in any branch tip.
这个 bash “one-liner” 显示存储库中 10 个最大的 blob,从最小到最大排序。与其他答案相反,这包括存储库跟踪的所有文件,甚至那些不存在于任何分支提示中的文件。
It's very fast, easy to copy & paste and only requires standard GNU utilities.
它非常快速,易于复制和粘贴,并且只需要标准的 GNU 实用程序。
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| tail -n 10 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
The first four lines implement the core functionality, the fifth limits the number of results, while the last two lines provide the nice human-readable outputthat looks like this:
前四行实现了核心功能,第五行限制了结果的数量,而最后两行提供了很好的人类可读的输出,如下所示:
...
0d99bb931299 530KiB path/to/some-image.jpg
2ba44098e28f 12MiB path/to/hires-image.png
bd1741ddce0d 63MiB path/to/some-video-1080p.mp4
For more information, including further filtering use cases and an output format more suitable for script processing, see my original answerto a similar question.
有关更多信息,包括进一步过滤用例和更适合脚本处理的输出格式,请参阅我对类似问题的原始回答。
macOS users: Since numfmt
is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils
.
macOS 用户:由于numfmt
在 macOS 上不可用,您可以省略最后一行并处理原始字节大小或brew install coreutils
.
回答by pranithk
How about
怎么样
git ls-files | xargs ls -l | sort -nrk5 | head -n 10
git ls-files
: List all the files in the repoxargs ls -l
: performls -l
on all the files returned ingit ls-files
sort -nrk5
: Numerically reverse sort the lines based on 5th columnhead -n 10
: Print the top 10 lines
git ls-files
: 列出 repo 中的所有文件xargs ls -l
:ls -l
对返回的所有文件执行git ls-files
sort -nrk5
: 根据第 5 列对行进行数字反向排序head -n 10
: 打印前 10 行
回答by studog
An improvement to raphinesse's answer, sort by size with largest first:
对 raphinesse 答案的改进,按大小排序,首先是最大的:
git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| awk '/^blob/ {print substr(git ls-tree -r -l --abbrev --full-name HEAD | Sort-Object {[int]($_ -split "\s+")[3]} | Select-Object -last 10
,6)}' \
| sort --numeric-sort --key=2 --reverse \
| head \
| cut --complement --characters=13-40 \
| numfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
回答by pix64
Cannot comment. ypid's answer modified for powershell
无法评论。ypid 为 powershell 修改的答案
git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | Where-Object {$_ -like "blob*"} | Sort-Object {[int]($_ -split "\s+")[2]} | Select-Object -last 10
Edit raphinesse's solution(ish)
编辑 raphinesse 的解决方案(ish)
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
Where-Object {$_ -like "blob*"} |
% { $tokens = $_ -split "\|"; [pscustomobject]@{ Hash = $tokens[1]; Size = [int]($tokens[2]); Name = $tokens[3] } } |
Sort-Object -Property Size -Descending |
Select-Object -First 50
回答by UnionP
On Windows, I started with @pix64's answer (thanks!) and modified it to handle files with spaces in the path, and also to output objects instead of strings:
在 Windows 上,我从 @pix64 的回答开始(谢谢!)并将其修改为处理路径中带有空格的文件,并输出对象而不是字符串:
Format-Table Hash, Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}
Even better, if you want to output the file sizes with nice file size units, you can add the DisplayInBytes function from here to your environment https://stackoverflow.com/a/40887001/892770, and then pipe the above to:
更好的是,如果您想以不错的文件大小单位输出文件大小,您可以从这里将 DisplayInBytes 函数添加到您的环境https://stackoverflow.com/a/40887001/892770,然后将上述内容通过管道传输到:
Hash Name Size
---- ---- ----
f51371aa843279a1efe45ff14f3dc3ec5f6b2322 types/react-native-snackbar-component/react 95.8 MB
84f3d727f6b8f99ab4698da51f9e507ae4cd8879 .ntvs_analysis.dat 94.5 MB
17d734397dcd35fdbd715d29ef35860ecade88cd fhir/fhir-tests.ts 11.5 KB
4c6a027cdbce093fd6ae15e65576cc8d81cec46c fhir/fhir-tests.ts 11.4 KB
This gives you output like:
这为您提供如下输出:
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype)|%(objectname)|%(objectsize)|%(rest)' |
Where-Object {$_ -like "blob*"} |
% { $tokens = $_ -split "\|"; [pscustomobject]@{ Size = [int]($tokens[2]); Extension = [System.IO.Path]::GetExtension($tokens[3]) } } |
Group-Object -Property Extension |
% { [pscustomobject]@{ Name = $_.Name; Size = ($_.Group | Measure-Object Size -Sum).Sum } } |
Sort-Object -Property Size -Descending |
select -First 20 -Property Name, @{Name="Size";Expression={ DisplayInBytes($_.Size) }}
Lastly, if you'd like to get all the largest file types, you can do so with:
最后,如果你想获得所有最大的文件类型,你可以这样做:
ls -lSh `git ls-files` | head
回答by tsvikas
For completion, here's the method I found:
为了完成,这是我找到的方法:
find * -type f -size +100M -print0 | xargs -0 git ls-files
The optional -h
prints the size in human-readable format.
可选-h
以人类可读的格式打印尺寸。
回答by First Zero
You can also use du
- Example: du -ah objects | sort -n -r | head -n 10
. du to get the size of the objects, sort
them and then picking the top 10 using head
.
您还可以使用du
- 示例:du -ah objects | sort -n -r | head -n 10
. du 获取对象的大小,sort
然后使用head
.
回答by Joey Adams
You can use find
to find files larger than a given threshold, then pass them to git ls-files
to exclude untracked files (e.g. build output):
您可以使用find
查找大于给定阈值的文件,然后将它们传递给以git ls-files
排除未跟踪的文件(例如构建输出):
Adjust 100M (100 megabytes) as needed until you get results.
根据需要调整 100M(100 兆字节),直到获得结果。
Minor caveat: this won't search top-level "hidden" files and folders (i.e. those whose names start with .
). This is because I used find *
instead of just find
to avoid searching the .git
database.
小警告:这不会搜索顶级“隐藏”文件和文件夹(即名称以 开头的文件和文件夹.
)。这是因为我使用find *
而不仅仅是find
为了避免搜索.git
数据库。
I was having trouble getting the sort -n
solutions to work (on Windows under Git Bash). I'm guessing it's due to indentation differences when xargs batches arguments, which xargs -0
seems to do automatically to work around Windows' command-line length limit of 32767.
我在获得sort -n
解决方案时遇到了麻烦(在 Git Bash 下的 Windows 上)。我猜这是由于 xargs 批处理参数时的缩进差异,这xargs -0
似乎可以自动解决 Windows 的命令行长度限制 32767。