Git 对于 100,000 个对象来说真的很慢。任何修复?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3313908/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Git is really slow for 100,000 objects. Any fixes?
提问by manumoomoo
I have a "fresh" git-svn repo (11.13 GB) that has over a 100,000 objects in it.
我有一个“新鲜”的 git-svn 存储库(11.13 GB),其中包含超过 100,000 个对象。
I have preformed
我已经预习
git fsck
git gc
on the repo after the initial checkout.
在初始结帐后的回购中。
I then tried to do a
然后我尝试做一个
git status
The time it takes to do a git status is anywhere from 2m25.578s and 2m53.901s
执行 git status 所需的时间在 2m25.578s 和 2m53.901s 之间
I tested git status by issuing the command
我通过发出命令测试了 git status
time git status
5 times and all of the times ran between the two times listed above.
5 次并且所有的时间都在上面列出的两个时间之间运行。
I am doing this on a Mac OS X, locally not through a VM.
我在 Mac OS X 上执行此操作,而不是通过 VM 在本地执行此操作。
There is no way it should be taking this long.
不可能花这么长时间。
Any ideas? Help?
有任何想法吗?帮助?
Thanks.
谢谢。
Edit
编辑
I have a co-worker sitting right next to me with a comparable box. Less RAM and running Debian with a jfs filesystem. His git statusruns in .3 on the same repo (it is also a git-svn checkout).
我有一个同事坐在我旁边,拿着一个类似的盒子。更少的内存并使用 jfs 文件系统运行 Debian。他的git status在同一个 repo 的 .3 中运行(它也是一个 git-svn checkout)。
Also, I recently changed my file permissions (to 777) on this folder and it brought the time down considerably (why, I have no clue). I can now get it done anywhere between 3 and 6 seconds. This is manageable, but still a pain.
另外,我最近在这个文件夹上更改了我的文件权限(到 777),它大大缩短了时间(为什么,我不知道)。我现在可以在 3 到 6 秒之间的任何地方完成它。这是可以控制的,但仍然很痛苦。
采纳答案by manumoomoo
It came down to a couple of items that I can see right now.
它归结为我现在可以看到的几个项目。
git gc --aggressive
- Opening up file permissions to
777
git gc --aggressive
- 打开文件权限
777
There has to be something else going on, but this was the things that clearly made the biggest impact.
肯定还有其他事情发生,但这显然是影响最大的事情。
回答by masonk
git status has to look at every file in the repository every time. You can tell it to stop looking at trees that you aren't working on with
git status 每次都必须查看存储库中的每个文件。你可以告诉它停止看你不工作的树
git update-index --assume-unchanged <trees to skip>
From the manpage:
从联机帮助页:
When these flags are specified, the object names recorded for the paths are not updated. Instead, these options set and unset the "assume unchanged" bit for the paths. When the "assume unchanged" bit is on, git stops checking the working tree files for possible modifications, so you need to manually unset the bit to tell git when you change the working tree file. This is sometimes helpful when working with a big project on a filesystem that has very slow lstat(2) system call (e.g. cifs).
This option can be also used as a coarse file-level mechanism to ignore uncommitted changes in tracked files (akin to what .gitignore does for untracked files). Git will fail (gracefully) in case it needs to modify this file in the index e.g. when merging in a commit; thus, in case the assumed-untracked file is changed upstream, you will need to handle the situation manually.
Many operations in git depend on your filesystem to have an efficient lstat(2) implementation, so that st_mtime information for working tree files can be cheaply checked to see if the file contents have changed from the version recorded in the index file. Unfortunately, some filesystems have inefficient lstat(2). If your filesystem is one of them, you can set "assume unchanged" bit to paths you have not changed to cause git not to do this check. Note that setting this bit on a path does not mean git will check the contents of the file to see if it has changed — it makes git to omit any checking and assume it has not changed. When you make changes to working tree files, you have to explicitly tell git about it by dropping "assume unchanged" bit, either before or after you modify them.
...
In order to set "assume unchanged" bit, use --assume-unchanged option. To unset, use --no-assume-unchanged.
The command looks at core.ignorestat configuration variable. When this is true, paths updated with git update-index paths… and paths updated with other git commands that update both index and working tree (e.g. git apply --index, git checkout-index -u, and git read-tree -u) are automatically marked as "assume unchanged". Note that "assume unchanged" bit is not set if git update-index --refresh finds the working tree file matches the index (use git update-index --really-refresh if you want to mark them as "assume unchanged").
指定这些标志时,不会更新为路径记录的对象名称。相反,这些选项设置和取消设置路径的“假设不变”位。当“假设未更改”位打开时,git 停止检查工作树文件以进行可能的修改,因此您需要手动取消设置该位以告诉 git 何时更改工作树文件。这有时在处理具有非常慢的 lstat(2) 系统调用(例如 cifs)的文件系统上的大型项目时很有用。
此选项还可用作粗略的文件级机制,以忽略跟踪文件中未提交的更改(类似于 .gitignore 对未跟踪文件所做的工作)。如果 Git 需要修改索引中的此文件,例如在合并提交时,Git 将失败(优雅地);因此,如果假定未跟踪的文件在上游发生更改,您将需要手动处理这种情况。
git 中的许多操作都依赖于您的文件系统来实现高效的 lstat(2) 实现,因此可以廉价地检查工作树文件的 st_mtime 信息,以查看文件内容是否与索引文件中记录的版本不同。不幸的是,一些文件系统的 lstat(2) 效率低下。如果您的文件系统是其中之一,您可以将“假设不变”位设置为您未更改的路径,以使 git 不执行此检查。请注意,在路径上设置此位并不意味着 git 将检查文件的内容以查看它是否已更改——它使 git 省略任何检查并假设它没有更改。当您对工作树文件进行更改时,您必须在修改它们之前或之后通过删除“假设未更改”位来明确地告诉 git。
...
为了设置“假设不变”位,请使用 --assume-unchanged 选项。要取消设置,请使用 --no-assume-unchanged。
该命令查看 core.ignorestat 配置变量。当这是真的时,路径用 git update-index 路径更新……并且路径用更新索引和工作树的其他 git 命令更新(例如 git apply --index、git checkout-index -u 和 git read-tree -u ) 被自动标记为“假设不变”。请注意,如果 git update-index --refresh 发现工作树文件与索引匹配,则不会设置“假设未更改”位(如果要将它们标记为“假设未更改”,请使用 git update-index --really-refresh)。
Now, clearly, this solution is only going to work if there are parts of the repo that you can conveniently ignore. I work on a project of similar size, and there are definitely large trees that I don't need to check on a regular basis. The semantics of git-status make it a generally O(n) problem (n in number of files). You need domain specific optimizations to do better than that.
现在,很明显,只有当您可以方便地忽略存储库的某些部分时,此解决方案才会起作用。我在一个类似规模的项目上工作,肯定有我不需要定期检查的大树。git-status 的语义使其成为一个通常为 O(n) 的问题(文件数为 n)。您需要特定领域的优化才能做得更好。
Note that if you work in a stitching pattern, that is, if you integrate changes from upstream by merge instead of rebase, then this solution becomes less convenient, because a change to an --assume-unchanged object merging in from upstream becomes a merge conflict. You can avoid this problem with a rebasing workflow.
请注意,如果您在拼接模式下工作,也就是说,如果您通过合并而不是 rebase 来集成来自上游的更改,那么此解决方案变得不太方便,因为对从上游合并进来的 --assume-unchanged 对象的更改变成了合并冲突。您可以使用变基工作流程来避免此问题。
回答by VonC
git status
should be quicker in Git 2.13 (Q2 2017), because of:
git status
在 Git 2.13(2017 年第二季度)中应该更快,因为:
- an optimization around array of string optimization (see "ways to improve
git status
performance") - a better "read cache" management.
- 围绕字符串优化数组的优化(参见“提高
git status
性能的方法”) - 更好的“读取缓存”管理。
On that last point, see commit a33fc72(14 Apr 2017) by Jeff Hostetler (jeffhostetler
).
(Merged by Junio C Hamano -- gitster
--in commit cdfe138, 24 Apr 2017)
关于最后一点,请参阅Jeff Hostetler ( ) 的commit a33fc72(14 Apr 2017 )。(由Junio C Hamano合并-- --在cdfe138 提交中,2017 年 4 月 24 日)jeffhostetler
gitster
read-cache
:force_verify_index_checksum
Teach git to skip verification of the SHA1-1 checksum at the end of the index file in
verify_hdr()
which is called fromread_index()
unless the "force_verify_index_checksum
" global variable is set.Teach
fsck
to force this verification.The checksum verification is for detecting disk corruption, and for small projects, the time it takes to compute SHA-1 is not that significant, but for gigantic repositories this calculation adds significant time to every command.
read-cache
:force_verify_index_checksum
教GIT中在索引文件的末尾跳过SHA1-1校验和验证
verify_hdr()
这是从所谓read_index()
除非“force_verify_index_checksum
”全局变量。教导
fsck
强制此验证。校验和验证用于检测磁盘损坏,对于小型项目,计算 SHA-1 所需的时间并不重要,但对于巨大的存储库,此计算为每个命令增加了大量时间。
Git 2.14 improves again git status performance by better taking into account the "untracked cache", which allows Git to skip reading the untracked directories if their stat
data have not changed, using the mtime
field of the stat
structure.
Git 2.14 通过更好地考虑“未跟踪缓存”再次提高了 git status 性能,这允许 Gitstat
使用结构mtime
字段跳过读取未跟踪目录的数据,如果它们的数据没有更改stat
。
See the Documentation/technical/index-format.txt
for more on untracked cache.
有关Documentation/technical/index-format.txt
未跟踪缓存的更多信息,请参见。
See commit edf3b90(08 May 2017) by David Turner (dturner-tw
).
(Merged by Junio C Hamano -- gitster
--in commit fa0624f, 30 May 2017)
请参阅David Turner ( )提交的 edf3b90(2017 年 5 月 8 日)。(由Junio C Hamano合并-- --在fa0624f 提交中,2017 年 5 月 30 日)dturner-tw
gitster
When "
git checkout
", "git merge
", etc. manipulates the in-core index, various pieces of information in the index extensions are discarded from the original state, as it is usually not the case that they are kept up-to-date and in-sync with the operation on the main index.The untracked cache extension is copied across these operations now, which would speed up "git status" (as long as the cache is properly invalidated).
当 "
git checkout
", "git merge
" 等操作核心索引时,索引扩展中的各种信息会从原始状态中被丢弃,因为它们通常不会保持最新和在-与主索引上的操作同步。未跟踪的缓存扩展现在在这些操作中被复制,这将加速“git status”(只要缓存正确失效)。
More generally, writing to the cache will be also quicker with Git 2.14.x/2.15
更一般地说,使用 Git 2.14.x/2.15 写入缓存也会更快
See commit ce012de, commit b50386c, commit 3921a0b(21 Aug 2017) by Kevin Willford (``).
(Merged by Junio C Hamano -- gitster
--in commit 030faf2, 27 Aug 2017)
请参阅Kevin Willford (``) 的commit ce012de、commit b50386c、commit 3921a0b(2017 年 8 月 21 日)。
(由Junio C gitster
Hamano合并-- --在commit 030faf2,2017 年 8 月 27 日)
We used to spend more than necessary cycles allocating and freeing piece of memory while writing each index entry out.
This has been optimized.[That] would save anywhere between 3-7% when the index had over a million entries with no performance degradation on small repos.
我们过去常常在写出每个索引条目时花费超过必要的周期来分配和释放内存。
这已被优化。[那] 当索引有超过 100 万个条目并且在小型 repos 上没有性能下降时,将节省 3-7% 之间的任何地方。
Update Dec. 2017: Git 2.16 (Q1 2018) will propose an additional enhancement, this time for git log
, since the code to iterate over loose object files just got optimized.
2017 年 12 月更新:Git 2.16(2018 年第一季度)将提出额外的增强功能,这次是针对git log
,因为迭代松散对象文件的代码刚刚得到优化。
See commit 163ee5e(04 Dec 2017) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
--in commit 97e1f85, 13 Dec 2017)
请参阅Derrick Stolee ( ) 的commit 163ee5e(2017 年 12 月 4 日)。(由Junio C Hamano合并-- --在提交 97e1f85 中,2017 年 12 月 13 日)derrickstolee
gitster
sha1_file
: usestrbuf_add()
instead ofstrbuf_addf()
Replace use of
strbuf_addf()
withstrbuf_add()
when enumerating loose objects infor_each_file_in_obj_subdir()
. Since we already check the length and hex-values of the string before consuming the path, we can prevent extra computation by using the lower- level method.One consumer of
for_each_file_in_obj_subdir()
is the abbreviation code. OID (object identifiers) abbreviations use a cached list of loose objects (per object subdirectory) to make repeated queries fast, but there is significant cache load time when there are many loose objects.Most repositories do not have many loose objects before repacking, but in the GVFScase (see "Announcing GVFS (Git Virtual File System)") the repos can grow to have millions of loose objects.
Profiling 'git log' performance in Git For Windowson a GVFS-enabled repo with ~2.5 million loose objects revealed 12% of the CPU time was spent instrbuf_addf()
.Add a new performance test to
p4211-line-log.sh
that is more sensitive to this cache-loading.
By limiting to 1000 commits, we more closely resemble user wait time when reading history into a pager.For a copy of the Linux repo with two ~512 MB packfiles and ~572K loose objects, running 'git log --oneline --parents --raw -1000' had the following performance:
sha1_file
: 使用strbuf_add()
代替strbuf_addf()
替换使用
strbuf_addf()
与strbuf_add()
在枚举散装物品时for_each_file_in_obj_subdir()
。由于我们在使用路径之前已经检查了字符串的长度和十六进制值,我们可以通过使用低级方法来防止额外的计算。一个消费者
for_each_file_in_obj_subdir()
是缩写代码。OID(对象标识符)缩写使用松散对象(每个对象子目录)的缓存列表来快速进行重复查询,但是当有许多松散对象时,缓存加载时间很长。大多数存储库在重新打包之前没有很多松散对象,但在GVFS情况下(参见“宣布 GVFS(Git 虚拟文件系统)”),存储库可以增长到具有数百万个松散对象。
在具有约 250 万个松散对象的启用 GVFS 的存储库上分析Git For Windows 中的“git log”性能显示,12% 的 CPU 时间花费在strbuf_addf()
.添加一个
p4211-line-log.sh
对缓存加载更敏感的新性能测试。
通过限制为 1000 次提交,我们在将历史记录读入寻呼机时更接近于用户等待时间。对于具有两个 ~512 MB 包文件和 ~572K 松散对象的 Linux 存储库副本,运行“git log --oneline --parents --raw -1000”具有以下性能:
HEAD~1 HEAD
----------------------------------------
7.70(7.15+0.54) 7.44(7.09+0.29) -3.4%
Update March 2018: Git 2.17 will improve git status
some more: see this answer.
2018 年 3 月更新:Git 2.17 将进一步改进git status
:请参阅此答案。
Update: Git 2.20 (Q4 2018) adds Index Entry Offset Table (IEOT), which allows for git status
to load the index faster.
更新:Git 2.20(2018 年第四季度)添加了索引条目偏移表 (IEOT),允许git status
更快地加载索引。
See commit 77ff112, commit 3255089, commit abb4bb8, commit c780b9c, commit 3b1d9e0, commit 371ed0d(10 Oct 2018) by Ben Peart (benpeart
).
See commit 252d079(26 Sep 2018) by Nguy?n Thái Ng?c Duy (pclouds
).
(Merged by Junio C Hamano -- gitster
--in commit e27bfaa, 19 Oct 2018)
参见Ben Peart ( ) 的commit 77ff112、commit 3255089、commit abb4bb8、commit c780b9c、commit 3b1d9e0、commit 371ed0d(2018 年 10 月 10 日)。
请参阅Nguy?n Thái Ng?c Duy ( ) 的commit 252d079(2018 年 9 月 26 日)。(由Junio C Hamano合并-- --在提交 e27bfaa 中,2018 年 10 月 19 日)benpeart
pclouds
gitster
read-cache: load cache entries on worker threads
This patch helps address the CPU cost of loading the index by utilizing the Index Entry Offset Table (IEOT)to divide loading and conversion of the cache entries across multiple threads in parallel.
I used
p0002-read-cache.sh
to generate some performance data:Test w/100,000 files reduced the time by 32.24% Test w/1,000,000 files reduced the time by -4.77%
Note that on the 1,000,000 files case, multi-threading the cache entry parsing does not yield a performance win. This is because the cost to parse the index extensions in this repo, far outweigh the cost of loading the cache entries.
读取缓存:在工作线程上加载缓存条目
此补丁通过利用索引条目偏移表 (IEOT)在多个线程之间并行划分缓存条目的加载和转换,帮助解决加载索引的 CPU 成本。
我曾经
p0002-read-cache.sh
生成一些性能数据:Test w/100,000 files reduced the time by 32.24% Test w/1,000,000 files reduced the time by -4.77%
请注意,在 1,000,000 个文件的情况下,多线程缓存条目解析不会产生性能优势。这是因为在这个 repo 中解析索引扩展的成本远远超过加载缓存条目的成本。
That allows for:
这允许:
config
: add newindex.threads
config settingAdd support for a new
index.threads
config setting which will be used to control the threading code indo_read_index()
.
- A value of 0 will tell the index code to automatically determine the correct number of threads to use.
A value of 1 will make the code single threaded.- A value greater than 1 will set the maximum number of threads to use.
For testing purposes, this setting can be overwritten by setting the
GIT_TEST_INDEX_THREADS=<n>
environment variable to a value greater than 0.
config
: 添加新的index.threads
配置设置添加对新
index.threads
配置设置的支持,该设置将用于控制do_read_index()
.
- 值 0 将告诉索引代码自动确定要使用的正确线程数。
值为 1 将使代码成为单线程的。- 大于 1 的值将设置要使用的最大线程数。
出于测试目的,可以通过将
GIT_TEST_INDEX_THREADS=<n>
环境变量设置为大于 0 的值来覆盖此设置 。
Git 2.21 (Q1 2019) introduces a new improvement, with the update of the loose object cache, used to optimize existence look-up, which has been updated.
Git 2.21(2019 年第一季度)引入了一项新的改进,更新了松散对象缓存,用于优化存在查找,已更新。
See commit 8be88db(07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65(06 Jan 2019) by René Scharfe (rscharfe
).
(Merged by Junio C Hamano -- gitster
--in commit eb8638a, 18 Jan 2019)
请参阅René Scharfe ( ) 的commit 8be88db(07 Jan 2019) 和commit 4cea1ce、commit d4e19e5、commit 0000d65(06 Jan 2019 )。(由Junio C Hamano合并-- --在eb8638a 提交中,2019 年 1 月 18 日)rscharfe
gitster
object-store
: use oneoid_array
per subdirectory for loose cacheThe loose objects cache is filled one subdirectory at a time as needed.
It is stored in anoid_array
, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.Use one
oid_array
for each subdirectory.
This ensures that entries have to only be sorted a single time.
It also avoids eight binary search steps for each cache lookup as a small bonus.The cache is used for collision checks for the log placeholders
%h
,%t
and%p
, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:
object-store
:oid_array
每个子目录使用一个用于松散缓存松散对象缓存根据需要一次填充一个子目录。
它存储在 中oid_array
,必须在每次添加操作后重新调用。
所以在查询大范围的对象时,部分填充的数组最多需要调用 255 次,这比排序一次要长 100 多倍。
oid_array
每个子目录使用一个。
这确保条目只需排序一次。
它还避免了每次缓存查找的八个二进制搜索步骤,这是一个小小的好处。缓存用于日志占位符
%h
、%t
和 的冲突检查%p
,我们可以在具有 ca 的存储库中看到加速它们的更改。每个子目录 100 个对象:
$ git count-objects
26733 objects, 68808 kilobytes
Test HEAD^ HEAD
--------------------------------------------------------------------
4205.1: log with %H 0.51(0.47+0.04) 0.51(0.49+0.02) +0.0%
4205.2: log with %h 0.84(0.82+0.02) 0.60(0.57+0.03) -28.6%
4205.3: log with %T 0.53(0.49+0.04) 0.52(0.48+0.03) -1.9%
4205.4: log with %t 0.84(0.80+0.04) 0.60(0.59+0.01) -28.6%
4205.5: log with %P 0.52(0.48+0.03) 0.51(0.50+0.01) -1.9%
4205.6: log with %p 0.85(0.78+0.06) 0.61(0.56+0.05) -28.2%
4205.7: log with %h-%h-%h 0.96(0.92+0.03) 0.69(0.64+0.04) -28.1%
With Git 2.26 (Q1 2020), the object reachability bitmap machinery and the partial cloning machinery were not prepared to work well together, because some object-filtering criteria that partial clones use inherently rely on object traversal, but the bitmap machinery is an optimization to bypass that object traversal.
在 Git 2.26(2020 年第一季度)中,对象可达性位图机制和部分克隆机制不能很好地协同工作,因为部分克隆使用的一些对象过滤标准本质上依赖于对象遍历,但位图机制是对绕过那个对象遍历。
There however are some cases where they can work together, and they were taught about them.
然而,在某些情况下,他们可以一起工作,并且他们被教导了这些。
See commit 20a5fd8(18 Feb 2020) by Junio C Hamano (gitster
).
See commit 3ab3185, commit 84243da, commit 4f3bd56, commit cc4aa28, commit 2aaeb9a, commit 6663ae0, commit 4eb707e, commit ea047a8, commit 608d9c9, commit 55cb10f, commit 792f811, commit d90fe06(14 Feb 2020), and commit e03f928, commit acac50d, commit 551cf8b(13 Feb 2020) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
--in commit 0df82d9, 02 Mar 2020)
请参阅Junio C Hamano() 的commit 20a5fd8(2020 年 2 月 18 日)。
见提交3ab3185,提交84243da,提交4f3bd56,提交cc4aa28,提交2aaeb9a,提交6663ae0,提交4eb707e,提交ea047a8,提交608d9c9,提交55cb10f,提交792f811,提交d90fe06(2020年2月14日),以及提交e03f928,提交acac50d,提交551cf8b(2020 年 2 月 13 日)作者:Jeff King ( )gitster
peff
.
(由Junio C gitster
Hamano合并-- --在0df82d9 提交中,2020 年 3 月 2 日)
pack-bitmap
: implementBLOB_NONE
filteringSigned-off-by: Jeff King
We can easily support
BLOB_NONE
filters with bitmaps.
Since we know the types of all of the objects, we just need to clear the result bits of any blobs.Note two subtleties in the implementation (which I also called out in comments):
- we have to include any blobs that were specifically asked for (and not reached through graph traversal) to match the non-bitmap version
- we have to handle in-pack and "ext_index" objects separately.
Arguably prepare_bitmap_walk() could be adding theseext_index
objects to the type bitmaps.
But it doesn't for now, so let's match the rest of the bitmap code here (it probably wouldn't be an efficiency improvement to do so since the cost of extending those bitmaps is about the same as our loop here, but it might make the code a bit simpler).Here are perf results for the new test on git.git:
Test HEAD^ HEAD -------------------------------------------------------------------------------- 5310.9: rev-list count with blob:none 1.67(1.62+0.05) 0.22(0.21+0.02) -86.8%
pack-bitmap
: 实现BLOB_NONE
过滤签字人:Jeff King
我们可以轻松地支持
BLOB_NONE
带有位图的过滤器。
因为我们知道所有对象的类型,所以我们只需要清除任何 blob 的结果位。注意实现中的两个微妙之处(我也在评论中指出):
- 我们必须包含任何专门要求(并且未通过图形遍历到达)的 blob 以匹配非位图版本
- 我们必须分别处理 in-pack 和“ext_index”对象。
可以说 prepare_bitmap_walk() 可以将这些ext_index
对象添加到类型位图。
但是现在还没有,所以让我们在这里匹配其余的位图代码(这样做可能不会提高效率,因为扩展这些位图的成本与我们在这里的循环大致相同,但它可能使代码更简单)。Test HEAD^ HEAD -------------------------------------------------------------------------------- 5310.9: rev-list count with blob:none 1.67(1.62+0.05) 0.22(0.21+0.02) -86.8%
To know more aboud oid_array
, consider Git 2.27 (Q2 2020)
要了解更多信息oid_array
,请考虑 Git 2.27(2020 年第二季度)
See commit 0740d0a, commit c79eddf, commit 7383b25, commit ed4b804, commit fe299ec, commit eccce52, commit 600bee4(30 Mar 2020) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
--in commit a768f86, 22 Apr 2020)
请参阅Jeff King ( ) 的commit 0740d0a、commit c79eddf、commit 7383b25、commit ed4b804、commit fe299ec、commit eccce52、commit 600bee4(2020 年 3 月 30 日)。(由Junio C Hamano合并-- --在提交 a768f86 中,2020 年 4 月 22 日)peff
gitster
oid_array
: usesize_t
for count and allocationSigned-off-by: Jeff King
The
oid_array
object uses an "int
" to store the number of items and the allocated size.It's rather unlikely for somebody to have more than 2^31 objects in a repository (the sha1's alone would be 40GB!), but if they do, we'd overflow our alloc variable.
You can reproduce this case with something like:
git init repo cd repo # make a pack with 2^24 objects perl -e ' my $nr = 2**24; for (my $i = 0; $i < $nr; $i++) { print "blob\n"; print "data 4\n"; print pack("N", $i); } | git fast-import # now make 256 copies of it; most of these objects will be duplicates, # but oid_array doesn't de-dup until all values are read and it can # sort the result. cd .git/objects/pack/ pack=$(echo *.pack) idx=$(echo *.idx) for i in $(seq 0 255); do # no need to waste disk space ln "$pack" "pack-extra-$i.pack" ln "$idx" "pack-extra-$i.idx" done # and now force an oid_array to store all of it git cat-file --batch-all-objects --batch-check
which results in:
fatal: size_t overflow: 32 * 18446744071562067968
So the good news is that
st_mult()
sees the problem (the large number is because our int wraps negative, and then that gets cast to asize_t
), doing the job it was meant to: bailing in crazy situations rather than causing an undersized buffer.But we should avoid hitting this case at all, and instead limit ourselves based on what
malloc()
is willing to give us.
We can easily do that by switching tosize_t
.The
cat-file
process above made it to ~120GB virtual set size before the integer overflow (our internal hash storage is 32-bytes now in preparation for sha256, so we'd expect ~128GB total needed, plus potentially more to copy from one realloc'd block to another)).
After this patch (and about 130GB of RAM+swap), it does eventually read in the whole set. No test for obvious reasons.
oid_array
:size_t
用于计数和分配签字人:Jeff King
该
oid_array
对象使用“int
”来存储项目的数量和分配的大小。有人在存储库中拥有超过 2^31 个对象是不太可能的(仅 sha1 就有 40GB!),但如果他们这样做,我们会溢出我们的 alloc 变量。
您可以使用以下内容重现此案例:
git init repo cd repo # make a pack with 2^24 objects perl -e ' my $nr = 2**24; for (my $i = 0; $i < $nr; $i++) { print "blob\n"; print "data 4\n"; print pack("N", $i); } | git fast-import # now make 256 copies of it; most of these objects will be duplicates, # but oid_array doesn't de-dup until all values are read and it can # sort the result. cd .git/objects/pack/ pack=$(echo *.pack) idx=$(echo *.idx) for i in $(seq 0 255); do # no need to waste disk space ln "$pack" "pack-extra-$i.pack" ln "$idx" "pack-extra-$i.idx" done # and now force an oid_array to store all of it git cat-file --batch-all-objects --batch-check
这导致:
fatal: size_t overflow: 32 * 18446744071562067968
所以好消息是
st_mult()
看到了问题(大量是因为我们的 int 包装了负数,然后被强制转换为 asize_t
),做它应该做的工作:在疯狂的情况下保释而不是导致缓冲区过小。但是我们应该完全避免碰到这种情况,而是根据
malloc()
愿意给我们的东西来限制自己。
我们可以通过切换到 轻松做到这一点size_t
。上述
cat-file
过程使其在整数溢出之前达到 ~120GB 虚拟集大小(我们的内部散列存储现在是 32 字节,为 sha256 做准备,因此我们预计总共需要 ~128GB,加上可能更多从一个重新分配的复制阻止到另一个))。
在这个补丁(以及大约 130GB 的 RAM+swap)之后,它最终会读取整个集合。由于明显的原因没有测试。
Note that this object was defined in sha1-array.c
, which has been renamed oid-array.c
: a more neutral name, considering Git will be eventually transition from SHA1 to SHA2.
请注意,该对象是在 中定义的sha1-array.c
,它已被重命名为oid-array.c
:一个更中性的名称,考虑到Git 最终将从 SHA1 过渡到 SHA2。
回答by Chris Kline
One longer-term solution is to augment git to cache filesystem status internally.
一种长期的解决方案是增加 git 以在内部缓存文件系统状态。
Karsten Blees has done so for msysgit, which dramatically improves performance on Windows. In my experiments, his change has taken the time for "git status" from 25 seconds to 1-2 seconds on my Win7 machine running in a VM.
Karsten Blees 为 msysgit 这样做了,这极大地提高了 Windows 上的性能。在我的实验中,在我在 VM 中运行的 Win7 机器上,他的更改将“git status”的时间从 25 秒缩短到 1-2 秒。
Karsten's changes: https://github.com/msysgit/git/pull/94
Karsten 的改动:https: //github.com/msysgit/git/pull/94
Discussion of the caching approach: https://groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/discussion
缓存方法的讨论:https: //groups.google.com/forum/#!topic/msysgit/fL_jykUmUNE/ discussion
回答by slobobaby
In general my mac is ok with git but if there are a lot of loose objects then it gets very much slower. It seems hfs is not so good with lots of files in a single directory.
一般来说,我的 mac 对 git 没问题,但如果有很多松散的物体,那么它会变得非常慢。在单个目录中包含大量文件时,hfs 似乎不太好。
git repack -ad
Followed by
其次是
git gc --prune=now
Will make a single pack file and remove any loose objects left over. It can take some time to run these.
将制作一个单独的包文件并删除剩余的任何松散物体。运行这些可能需要一些时间。
回答by Brendon McLean
For what it's worth, I recently found a large discrepancy beween the git status
command between my master and dev branches.
值得一提的是,我最近发现git status
我的 master 和 dev 分支之间的命令存在很大差异。
To cut a long story short, I tracked down the problem to a single 280MB file in the project root directory. It was an accidental checkin of a database dump so it was fine to delete it.
长话短说,我将问题追溯到项目根目录中的一个 280MB 文件。这是数据库转储的意外签入,因此可以将其删除。
Here's the before and after:
这是之前和之后:
? time git status
# On branch master
nothing to commit (working directory clean)
git status 1.35s user 0.25s system 98% cpu 1.615 total
? rm savedev.sql
? time git status
# On branch master
# Changes not staged for commit:
# (use "git add/rm <file>..." to update what will be committed)
# (use "git checkout -- <file>..." to discard changes in working directory)
#
# deleted: savedev.sql
#
no changes added to commit (use "git add" and/or "git commit -a")
git status 0.07s user 0.08s system 98% cpu 0.157 total
I have 105,000 objects in store, but it appears that large files are more of a menace than many small files.
我存储了 105,000 个对象,但看起来大文件比许多小文件更具威胁。
回答by David Underhill
You could try passing the --aggressive
switch to git gc
and see if that helps:
您可以尝试将--aggressive
开关传递给git gc
,看看是否有帮助:
# this will take a while ...
git gc --aggressive
Also, you could use git filter-branch
to delete old commits and/or files if you have things which you don't need in your history (e.g., old binary files).
此外,git filter-branch
如果您的历史记录中有不需要的内容(例如,旧的二进制文件),您可以使用删除旧的提交和/或文件。
回答by baudtack
You also might try git repack
你也可以试试 git repack
回答by neoneye
maybe spotlight is trying to index the files. Perhaps disable spotlight for your code dir. Check Activity Monitor and see what processes are running.
也许聚光灯正在尝试索引文件。也许为您的代码目录禁用聚光灯。检查活动监视器并查看正在运行的进程。
回答by srparish
I'd create a partition using a different file system. HFT+ has always been sluggish for me compared to doing similar operations on other file systems.
我会使用不同的文件系统创建一个分区。与在其他文件系统上进行类似操作相比,HFT+ 对我来说一直很迟钝。