git 如何检测类似文件,以进行重命名检测?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7938582/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How does git detect similar files, for its rename detection?
提问by mahemoff
Wikipedia explains the automatic rename detection:
维基百科解释了自动重命名检测:
Briefly, given a file in revision N, a file of the same name in revision N?1 is its default ancestor. However, when there is no like-named file in revision N?1, Git searches for a file that existed only in revision N?1 and is very similarto the new file.
简而言之,给定版本 N 中的文件,版本 N?1 中的同名文件是其默认祖先。但是,当修订版 N?1 中没有同名文件时,Git 会搜索仅存在于修订版 N?1 中且与新文件非常相似的文件。
Rename detection apparently boils down to similar file detection. Is that algorithm documented anywhere? It would be nice to know what kinds of transformations are detected automatically.
重命名检测显然归结为类似的文件检测。该算法是否记录在任何地方?很高兴知道自动检测到哪些类型的转换。
回答by manojlds
Git tracks file contents, not filenames. So renaming a file without changing its content is easy for git to detect. (Git does not track, but performs detection; using git mv
or git rm
and git add
is effectively the same.)
Git 跟踪文件内容,而不是文件名。因此,重命名文件而不更改其内容很容易被 git 检测到。(Git 不跟踪,但执行检测;使用git mv
orgit rm
和git add
实际上是相同的。)
When a file is added to the repository, the filename is in the tree object. The actual file contents are added as a binary large object (blob) in the repository. Git will not add another blob for additional files that contain the same content. In fact, Git cannot as the content is stored in the filesystem with first two characters of the hash being the directory name and the rest being the name of file within it. So to detect renames is a matter of comparing hashes.
将文件添加到存储库时,文件名位于树对象中。实际文件内容作为二进制大对象 ( blob) 添加到存储库中。Git 不会为包含相同内容的其他文件添加另一个 blob。事实上,Git 不能,因为内容存储在文件系统中,哈希的前两个字符是目录名,其余是其中的文件名。因此,检测重命名是比较哈希值的问题。
To detect small changes to a renamed file, Git uses certain algorithms and a threshold limit to see if this is a rename. For example, have a look at the -M
flag for git diff
. There are also configuration values such as merge.renameLimit
(the number of files to consider when performing rename detection during a merge).
为了检测重命名文件的细微变化,Git 使用某些算法和阈值限制来查看这是否是重命名。例如,有一个看-M
标志git diff
。还有一些配置值,例如merge.renameLimit
(在合并期间执行重命名检测时要考虑的文件数量)。
To understand how git treats similarfiles (i.e., what file transformations are considered as renames), explore the configuration options and flags available, as mentioned above. You need not be considered with the how. To understand how git actually accomplishes these tasks, look at the algorithms for finding differences in text, and read the git source code.
要了解 git 如何处理类似文件(即,哪些文件转换被视为重命名),请探索可用的配置选项和标志,如上所述。你不需要考虑如何。要了解 git 如何实际完成这些任务,请查看查找文本差异的算法,并阅读 git 源代码。
Algorithms are applied only for diff, merge, and log purposes -- they do not affect how git stores them. Any small change in file content means a new object is added for it. There is no delta or diff happening at that level. Of course, later, the objects might be packed where deltas are stored in packfiles, but that is not related to the rename detection.
算法仅适用于差异、合并和日志目的——它们不影响 git 存储它们的方式。文件内容的任何微小变化都意味着为其添加了一个新对象。在该级别没有发生增量或差异。当然,稍后,对象可能会被打包,增量存储在打包文件中,但这与重命名检测无关。
回答by GolezTrol
There are many algorithms that detect similarities between texts, and version control systems often use these already to store only the difference between two versions. Tools like WinMerge are smart enough to detect differences, even within lines, so I don't see a reason why these algorithms would not be used for this rename detection.
有许多算法可以检测文本之间的相似性,而版本控制系统通常已经使用这些算法来仅存储两个版本之间的差异。像 WinMerge 这样的工具足够聪明,可以检测差异,即使是在行内,所以我看不出为什么不将这些算法用于此重命名检测的原因。
Here is a discussion about algorithms to detect similar texts. Some of these algorithms might be optimized for natural languages, while others may work better for source code, but in essence they are very much alike.
这里是关于检测相似文本的算法的讨论。其中一些算法可能针对自然语言进行了优化,而其他算法可能更适合源代码,但本质上它们非常相似。