如何在 git 历史记录中查找/识别大型提交？

Question

提问by raphinesse

I have a 300 MB git repo. The total size of my currently checked-out files is 2 MB, and the total size of the rest of the git repo is 298 MB. This is basically a code-only repo that should not be more than a few MB.

我有一个 300 MB 的 git 仓库。我当前签出的文件的总大小为 2 MB，其余 git repo 的总大小为 298 MB。这基本上是一个代码存储库，不应超过几 MB。

I suspect someone accidentally committed some large files (video, images, etc), and then removed them... but not from git, so the history still contains useless large files. How can find the large files in the git history? There are 400+ commits, so going one-by-one is not practical.

我怀疑有人不小心提交了一些大文件（视频、图像等），然后将它们删除……但不是从 git 中删除，因此历史记录仍然包含无用的大文件。如何在git历史中找到大文件？有 400 多个提交，因此一一进行是不切实际的。

NOTE: my question is not about how to remove the file, but how to findit in the first place.

注意：我的问题不是关于如何删除文件，而是如何首先找到它。

Answer 1

采纳答案by Mark Longair

I've found this script very useful in the past for finding large (and non-obvious) objects in a git repository:

我发现这个脚本过去对于在 git 存储库中查找大型（和非明显）对象非常有用：

http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

#!/bin/bash
#set -x 

# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs

# set the internal field separator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';

# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`

echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."

output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`echo "${allObjects}" | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done

echo -e $output | column -t -s ', '

That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one:

这将为您提供 blob 的对象名称 (SHA1sum)，然后您可以使用这样的脚本：

Which commit has this blob?

哪个提交有这个 blob？

... to find the commit that points to each of those blobs.

...找到指向每个 blob 的提交。

Answer 2

回答by raphinesse

A blazingly fast shell one-liner

超快的外壳单衬

This shell script displays all blob objects in the repository, sorted from smallest to largest.

此 shell 脚本显示存储库中的所有 blob 对象，从小到大排序。

For my sample repo, it ran about 100 times fasterthan the other ones found here.
On my trusty Athlon II X4 system, it handles the Linux Kernel repositorywith its 5.6 million objects in just over a minute.

对于我的示例存储库，它的运行速度比此处找到的其他存储库快 100 倍。
在我值得信赖的 Athlon II X4 系统上，它可以在一分钟内处理包含 560 万个对象的Linux 内核存储库。

The Base Script

基本脚本

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

When you run above code, you will get nice human-readable outputlike this:

当你运行上面的代码时，你会得到很好的人类可读的输出，如下所示：

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

macOS users: Since numfmtis not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

macOS 用户：由于numfmt在 macOS 上不可用，您可以省略最后一行并处理原始字节大小或brew install coreutils.

Filtering

过滤

To achieve further filtering, insert any of the following lines before the sortline.

要实现进一步过滤，请在该行之前sort插入以下任一行。

To exclude files that are present in HEAD, insert the following line:

要排除中存在的文件HEAD，请插入以下行：

| grep -vF --file=<(git ls-tree -r HEAD | awk '{print }') \

To show only files exceeding given size(e.g. 1 MiB = 2²⁰ B), insert the following line:

要仅显示超过给定大小（例如 1 MiB = 2 ²⁰ B）的文件，请插入以下行：

| awk ' >= 2^20' \

Output for Computers

计算机输出

To generate output that's more suitable for further processingby computers, omit the last two lines of the base script. They do all the formatting. This will leave you with something like this:

要生成更适合计算机进一步处理的输出，请省略基本脚本的最后两行。他们做所有的格式化。这会给你留下这样的东西：

...
0d99bb93129939b72069df14af0d0dbda7eb6dba 542455 path/to/some-image.jpg
2ba44098e28f8f66bac5e21210c2774085d2319b 12446815 path/to/hires-image.png
bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f 65183843 path/to/some-video-1080p.mp4

File Removal

文件删除

For the actual file removal, check out this SO question on the topic.

对于实际的文件删除，请查看有关该主题的 SO 问题。

Answer 3

回答by skolima

I've found a one-liner solution on ETH Zurich Department of Physics wiki page(close to the end of that page). Just do a git gcto remove stale junk, and then

我在ETH Zurich Department of Physics wiki 页面（接近该页面的末尾）找到了一个单行解决方案。只需做一个git gc删除陈旧的垃圾，然后

git rev-list --objects --all \
  | grep "$(git verify-pack -v .git/objects/pack/*.idx \
           | sort -k 3 -n \
           | tail -10 \
           | awk '{print}')"

will give you the 10 largest files in the repository.

将为您提供存储库中的 10 个最大文件。

There's also a lazier solution now available, GitExtensionsnow has a plugin that does this in UI (and handles history rewrites as well).

现在还有一个更懒惰的解决方案，GitExtensions现在有一个插件可以在 UI 中执行此操作（并且还处理历史重写）。

GitExtensions 'Find large files' dialog

GitExtensions“查找大文件”对话框

Answer 4

回答by friederbluemle

Step 1Write all file SHA1s to a text file:

步骤 1将所有文件 SHA1 写入文本文件：

git rev-list --objects --all | sort -k 2 > allfileshas.txt

Step 2Sort the blobs from biggest to smallest and write results to text file:

步骤 2将 blob 从大到小排序并将结果写入文本文件：

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

Step 3aCombine both text files to get file name/sha1/size information:

步骤 3a合并两个文本文件以获取文件名/sha1/大小信息：

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print ,,}' >> bigtosmall.txt
done;

Step 3bIf you have file names or path names containing spacestry this variation of Step 3a. It uses cutinstead of awkto get the desired columns incl. spaces from column 7 to end of line:

步骤 3b如果您的文件名或路径名包含空格，请尝试步骤 3a 的这种变体。它使用cut而不是awk获取所需的列，包括。从第 7 列到行尾的空格：

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | cut -d ' ' -f'1,3,7-' >> bigtosmall.txt
done;

Now you can look at the file bigtosmall.txt in order to decide which files you want to remove from your Git history.

现在，您可以查看文件 bigtosmall.txt 以确定要从 Git 历史记录中删除哪些文件。

Step 4To perform the removal (note this part is slow since it's going to examine every commit in your history for data about the file you identified):

第 4 步执行删除（请注意，这部分很慢，因为它将检查历史记录中的每个提交以获取有关您识别的文件的数据）：

git filter-branch --tree-filter 'rm -f myLargeFile.log' HEAD

Source

来源

Steps 1-3a were copied from Finding and Purging Big Files From Git History

步骤 1-3a 复制自从 Git 历史记录中查找和清除大文件

EDIT

编辑

The article was deleted sometime in the second half of 2017, but an archived copy of itcan still be accessed using the Wayback Machine.

这篇文章在 2017 年下半年的某个时候被删除，但它的存档副本仍然可以使用Wayback Machine访问。

Answer 5

回答by Warren Seine

You should use BFG Repo-Cleaner.

您应该使用BFG Repo-Cleaner。

According to the website:

根据网站：

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history:
Removing Crazy Big Files
Removing Passwords, Credentials & other Private data

BFG 是 git-filter-branch 的更简单、更快的替代方案，用于清除 Git 存储库历史记录中的不良数据：
删除疯狂的大文件
删除密码、凭据和其他私人数据

The classic procedure for reducing the size of a repository would be:

减少存储库大小的经典程序是：

git clone --mirror git://example.com/some-big-repo.git
java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git
cd some-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push

Answer 6

回答by schmijos

If you only want to have a list of large files, then I'd like to provide you with the following one-liner:

如果您只想拥有一个大文件列表，那么我想为您提供以下单行：

join -o "1.1 1.2 2.3" <(git rev-list --objects --all | sort) <(git verify-pack -v objects/pack/*.idx | sort -k3 -n | tail -5 | sort) | sort -k3 -n

Whose output will be:

谁的输出将是：

commit       file name                                  size in bytes

72e1e6d20... db/players.sql 818314
ea20b964a... app/assets/images/background_final2.png 6739212
f8344b9b5... data_test/pg_xlog/000000010000000000000001 1625545
1ecc2395c... data_development/pg_xlog/000000010000000000000001 16777216
bc83d216d... app/assets/images/background_1forfinal.psd 95533848

The last entry in the list points to the largest file in your git history.

列表中的最后一个条目指向 git 历史记录中最大的文件。

You can use this output to assure that you're not deleting stuff with BFGyou would have needed in your history.

您可以使用此输出来确保您不会使用历史记录中需要的BFG删除内容。

Answer 7

回答by Julia Schwarz

If you are on Windows, here is a PowerShell script that will print the 10 largest files in your repository:

如果您使用的是 Windows，这里有一个 PowerShell 脚本，它将打印存储库中的 10 个最大文件：

$revision_objects = git rev-list --objects --all;
$files = $revision_objects.Split() | Where-Object {$_.Length -gt 0 -and $(Test-Path -Path $_ -PathType Leaf) };
$files | Get-Item -Force | select fullname, length | sort -Descending -Property Length | select -First 10

Answer 8

回答by Vojtech Vitek

Try git ls-files | xargs du -hs --threshold=1M.

试试git ls-files | xargs du -hs --threshold=1M。

We use the below command in our CI pipeline, it halts if it finds any big files in the git repo:

我们在 CI 管道中使用以下命令，如果在 git repo 中找到任何大文件，它就会停止：

test $(git ls-files | xargs du -hs --threshold=1M 2>/dev/null | tee /dev/stderr | wc -l) -gt 0 && { echo; echo "Aborting due to big files in the git repository."; exit 1; } || true

Answer 9

回答by pdp

I was unable to make use of the most popular answer because the --batch-checkcommand-line switch to Git 1.8.3 (that I have to use) does not accept any arguments. The ensuing steps have been tried on CentOS 6.5 with Bash 4.1.2

我无法使用最流行的答案，因为--batch-check命令行切换到 Git 1.8.3（我必须使用）不接受任何参数。随后的步骤已在 CentOS 6.5 和 Bash 4.1.2 上进行了尝试

Key Concepts

关键概念

In Git, the term blobimplies the contents of a file. Note that a commit might change the contents of a file or pathname. Thus, the same file could refer to a different blob depending on the commit. A certain file could be the biggest in the directory hierarchy in one commit, while not in another. Therefore, the question of finding large commits instead of large files, puts matters in the correct perspective.

在 Git 中，术语blob表示文件的内容。请注意，提交可能会更改文件或路径名的内容。因此，同一个文件可以根据提交引用不同的 blob。某个文件在一次提交中可能是目录层次结构中最大的，而在另一次提交中则不是。因此，寻找大提交而不是大文件的问题以正确的角度看待问题。

For The Impatient

对于不耐烦的人

Command to print the list of blobs in descending order of size is:

按大小降序打印 blob 列表的命令是：

git cat-file --batch-check < <(git rev-list --all --objects  | \
awk '{print }')  | grep blob  | sort -n -r -k 3

Sample output:

示例输出：

3a51a45e12d4aedcad53d3a0d4cf42079c62958e blob 305971200
7c357f2c2a7b33f939f9b7125b155adbd7890be2 blob 289163620

To remove such blobs, use the BFG Repo Cleaner, as mentioned in other answers. Given a file blobs.txtthat just contains the blob hashes, for example:

要删除此类 blob，请使用BFG Repo Cleaner，如其他答案中所述。给定一个blobs.txt只包含 blob 哈希的文件，例如：

3a51a45e12d4aedcad53d3a0d4cf42079c62958e
7c357f2c2a7b33f939f9b7125b155adbd7890be2

Do:

做：

java -jar bfg.jar -bi blobs.txt <repo_dir>

The question is about finding the commits, which is more work than finding blobs. To know, please read on.

问题是关于查找提交，这比查找 blob 需要更多的工作。要知道，请继续阅读。

Further Work

进一步的工作

Given a commit hash, a command that prints hashes of all objects associated with it, including blobs, is:

给定一个提交散列，打印与其关联的所有对象（包括 blob）的散列的命令是：

git ls-tree -r --full-tree <commit_hash>

So, if we have such outputs available for all commits in the repo, then given a blob hash, the bunch of commits are the ones that match any of the outputs. This idea is encoded in the following script:

所以，如果我们有这样的输出可用于 repo 中的所有提交，那么给定一个 blob 哈希，这束提交就是匹配任何输出的那些。这个想法被编码在以下脚本中：

#!/bin/bash
DB_DIR='trees-db'

find_commit() {
    cd ${DB_DIR}
    for f in *; do
        if grep -q  ${f}; then
            echo ${f}
        fi
    done
    cd - > /dev/null
}

create_db() {
    local tfile='/tmp/commits.txt'
    mkdir -p ${DB_DIR} && cd ${DB_DIR}
    git rev-list --all > ${tfile}

    while read commit_hash; do
        if [[ ! -e ${commit_hash} ]]; then
            git ls-tree -r --full-tree ${commit_hash} > ${commit_hash}
        fi
    done < ${tfile}
    cd - > /dev/null
    rm -f ${tfile}
}

create_db

while read id; do
    find_commit ${id};
done

If the contents are saved in a file named find-commits.shthen a typical invocation will be as under:

如果内容保存在一个名为的文件中，find-commits.sh那么典型的调用如下：

cat blobs.txt | find-commits.sh

As earlier, the file blobs.txtlists blob hashes, one per line. The create_db()function saves a cache of all commit listings in a sub-directory in the current directory.

如前所述，该文件blobs.txt列出了 blob 哈希，每行一个。该create_db()函数将所有提交列表的缓存保存在当前目录的子目录中。

Some stats from my experiments on a system with two Intel(R) Xeon(R) CPU E5-2620 2.00GHz processors presented by the OS as 24 virtual cores:

我在具有两个 Intel(R) Xeon(R) CPU E5-2620 2.00GHz 处理器的系统上的实验中的一些统计数据由操作系统显示为 24 个虚拟内核：

Total number of commits in the repo = almost 11,000
File creation speed = 126 files/s. The script creates a single file per commit. This occurs only when the cache is being created for the first time.
Cache creation overhead = 87 s.
Average search speed = 522 commits/s. The cache optimization resulted in 80% reduction in running time.

回购中的提交总数 = 近 11,000
文件创建速度 = 126 个文件/秒。该脚本为每次提交创建一个文件。这仅在第一次创建缓存时发生。
缓存创建开销 = 87 秒。
平均搜索速度 = 522 次提交/秒。缓存优化使运行时间减少了 80%。

Note that the script is single threaded. Therefore, only one core would be used at any one time.

请注意，该脚本是单线程的。因此，任何时候都只能使用一个核心。

Answer 10

回答by Aaron

Powershell solution for windows git, find the largest files:

Windows git的Powershell解决方案，找到最大的文件：

git ls-tree -r -t -l --full-name HEAD | Where-Object {
 $_ -match '(.+)\s+(.+)\s+(.+)\s+(\d+)\s+(.*)'
 } | ForEach-Object {
 New-Object -Type PSObject -Property @{
     'col1'        = $matches[1]
     'col2'      = $matches[2]
     'col3' = $matches[3]
     'Size'      = [int]$matches[4]
     'path'     = $matches[5]
 }
 } | sort -Property Size -Top 10 -Descending

如何在 git 历史记录中查找/识别大型提交？

提问by raphinesse

采纳答案by Mark Longair

回答by raphinesse

A blazingly fast shell one-liner

超快的外壳单衬

The Base Script

基本脚本

Filtering

过滤

Output for Computers

计算机输出

File Removal

文件删除

回答by skolima

回答by friederbluemle

回答by Warren Seine

回答by schmijos

回答by Julia Schwarz

回答by Vojtech Vitek

回答by pdp

Key Concepts

关键概念

For The Impatient

对于不耐烦的人

Further Work

进一步的工作

回答by Aaron

相关推荐

最近更新

标签

如何在 git 历史记录中查找/识别大型提交？

提问by raphinesse

采纳答案by Mark Longair

回答by raphinesse

A blazingly fast shell one-liner

超快的外壳单衬

The Base Script

基本脚本

Filtering

过滤

Output for Computers

计算机输出

File Removal

文件删除

回答by skolima

回答by friederbluemle

回答by Warren Seine

回答by schmijos

回答by Julia Schwarz

回答by Vojtech Vitek

回答by pdp

Key Concepts

关键概念

For The Impatient

对于不耐烦的人

Further Work

进一步的工作

回答by Aaron

相关推荐

使用指定存储库中的 Git 分支动态填充 Jenkins 选择参数

git git中的哈希冲突

强制 Git 子模块始终保持最新

在 Heroku 上清理 git 仓库

相关推荐

最近更新

标签