git 匹配 SHA1 的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/468370/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
A Regex to match a SHA1
提问by git-noob
I'm trying to match SHA1's in generic text with a regular expression.
我正在尝试将通用文本中的 SHA1 与正则表达式匹配。
Ideally I want to avoid matching words.
理想情况下,我想避免匹配单词。
It's safe to say that full SHA1's have a distinctive pattern (they're long and a consistent length) - so I can match these reliably - but what about abbreviated SHA1's?
可以肯定地说,完整的 SHA1 具有独特的模式(它们很长且长度一致)——所以我可以可靠地匹配这些——但是缩写的 SHA1 呢?
Can I rely on the presence of numbers?
我可以依靠数字的存在吗?
Looking at the SHA1's in my commit log - numbers always appear in the first 3 characters. But is this too short? How many characters of SHA1 do I need to consider before I can assume a number would have appeared?
查看提交日志中的 SHA1 - 数字总是出现在前 3 个字符中。但这是否太短了?在假设某个数字会出现之前,我需要考虑多少个 SHA1 字符?
This does not have to be 100% accurate - I just need to match an abbreviated SHA1 99% of the time.
这不必 100% 准确 - 我只需要在 99% 的情况下匹配缩写的 SHA1。
回答by Greg Hewgill
You can consider the SHA1 hashes to be completely random, so this reduces to a matter of probabilities. The probability that a given digit is not a number is 6/16, or 0.375. The probability that three SHA1 digits are all not numbers is 0.375 ** 3, or 0.0527 (5% ish). At six digits, this reduces again to 0.00278 (0.2%). At five digits, the probability of all letters drops below 1% (you said you wanted to match 99% of the time).
您可以将 SHA1 哈希视为完全随机的,因此这归结为概率问题。给定数字不是数字的概率是 6/16,即 0.375。三个 SHA1 数字都不是数字的概率是 0.375 ** 3,或 0.0527 (5% ish)。在六位数时,这再次减少到 0.00278 (0.2%)。在五位数时,所有字母出现的概率低于 1%(你说你想要匹配 99% 的时间)。
It's easy to craft a regular expression that always matches SHA1 values:
制作一个始终匹配 SHA1 值的正则表达式很容易:
\b[0-9a-f]{5,40}\b
However, this may also match perfectly good five letter words, like "added" or "faded". In my /usr/share/dict/words
file, there are several six letter words that would match: "accede", "beaded", "bedded", "decade", "deface", "efface", and "facade" are the most likely. At seven letters, there is only "deedeed" which is unlikely to appear in prose. It all depends on how many false positives you can tolerate, and what the likely words you will encounter actually are.
但是,这也可能与完美的五个字母单词匹配,例如“添加”或“褪色”。在我的/usr/share/dict/words
文件中,有几个六个字母的单词会匹配:“accede”、“beaded”、“bedded”、“decade”、“deface”、“efface”和“facade”是最有可能的。在七个字母中,只有“契约”不太可能出现在散文中。这完全取决于您可以容忍多少误报,以及您实际遇到的可能单词是什么。
回答by jrockway
What exactly are you trying to do? You shouldn't need to parse anything git outputs with heuristics -- you can always request exactly the data you need.
你到底想做什么?您不需要使用启发式方法解析任何 git 输出——您始终可以准确地请求您需要的数据。
If you want to match a full hex representation of an SHA1 sum, try:
如果要匹配 SHA1 和的完整十六进制表示,请尝试:
/\b([a-f0-9]{40})\b/
That is, a word consisting of 40 characters which are either digits or the letters a through f.
也就是说,一个由 40 个字符组成的单词,这些字符要么是数字,要么是字母 a 到 f。
If you only have a few characters and don't know where they are, then you are pretty much out of luck. Is "e78fd98" an abbreviated commit ID? Maybe, but what about "1234567"? Is that a commit ID? A problem ticket number? A number that makes a test fail?
如果您只有几个字符并且不知道它们在哪里,那么您就很不走运了。“e78fd98”是缩写的提交ID吗?也许吧,但是“1234567”呢?那是提交ID吗?问题票号?一个使测试失败的数字?
Without context, you can't really know what the data means.
没有上下文,您无法真正了解数据的含义。
To answer your direct question, there is no property of SHA1 that would make the first three characters (in hex form) digits. You are just lucky, or perhaps unlucky, depending on how you look at it.
要回答您的直接问题,SHA1 没有任何属性可以使前三个字符(以十六进制形式)成为数字。你是幸运的,也可能是不幸的,这取决于你如何看待它。
回答by bendin
I'm going to assume you want to match against hexadecimal printed representation of a SHA1, and not against the equivalent 20 raw bytes. Furthermore, I'm going to assume that the SHA1's in question use only lower-case letters to represent hex digits. You'll have to adjust the regular expression if your requirements differ.
我将假设您要与 SHA1 的十六进制打印表示进行匹配,而不是与等效的 20 个原始字节进行匹配。此外,我将假设所讨论的 SHA1 仅使用小写字母来表示十六进制数字。如果您的要求不同,则必须调整正则表达式。
grep -o -E -e "[0-9a-f]{40}"
Will match such a SHA1. You'll need to translate the above regular expression from egrep's dialect to whatever tool you happen to be using. Since the match must be exactly 40 characters long I don't think you're in danger of accidentally matching words. I don't know of any 40-character words that consist only of the letters a through f.
将匹配这样的 SHA1。您需要将上述正则表达式从 egrep 的方言转换为您碰巧使用的任何工具。由于匹配必须正好是 40 个字符长,我认为您不会有意外匹配单词的危险。我不知道任何仅由字母 a 到 f 组成的 40 个字符的单词。
edit:
编辑:
Better yet: use A Regex to match a SHA1as his solution includes checking for word boundaries at both ends. I overlooked that above.
更好的是:使用正则表达式匹配 SHA1,因为他的解决方案包括检查两端的字边界。我忽略了上面的内容。
回答by Neil Mayhew
If you have access to the repo, you can use git cat-file -e
to check for sure that it represents an object in the repo. This is very fast, too. If you further want to restrict this to just commits and tags, you can use git cat-file -t
to find out the type of the object.
如果您有权访问 repo,则可以使用git cat-file -e
来检查它是否代表 repo 中的一个对象。这也非常快。如果您想进一步将其限制为仅提交和标记,您可以使用git cat-file -t
来找出对象的类型。
This could be used, for example, to search human-generated text for mentions of git commits and generate hyperlinks to a git web interface.
例如,这可用于搜索人工生成的文本以查找 git 提交的提及并生成指向 git Web 界面的超链接。
回答by JeffCharter
I use this in ruby. It allows for a short version of the sha (6 - 8 in case of clashes) and for the full sha at 40 chars long.
我在红宝石中使用它。它允许 sha 的短版本(在发生冲突时为 6 - 8)和 40 个字符长的完整 sha。
\A(([0-9a-f]{40})|([0-9a-f]{6,8}))\z
回答by Dededede4
For this type of hash : 43:A4:02:B7:B6:1D:89:86:C5:CE:AD:52:96:D9:2E:7B:64:98:45:6A
:
对于这种类型的哈希 43:A4:02:B7:B6:1D:89:86:C5:CE:AD:52:96:D9:2E:7B:64:98:45:6A
::
/^[0-9A-F]{2}(:[0-9A-F]{2}){19}$/