.net 最佳 HashTag 正则表达式

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1563844/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-03 13:26:32  来源:igfitidea点击:

Best HashTag Regex

.netregextwitter

提问by Kevin Mark

I'm trying to find all the hash tags in a string. The hashtags are from a stream like twitter, they could be anywhere in the text like:

我试图在一个字符串中找到所有的哈希标签。主题标签来自像 twitter 这样的流,它们可以位于文本中的任何位置,例如:

this is a #awesome event, lets use the tag #fun

这是一个#awesome 事件,让我们使用标签#fun

I'm using the .NET framework (c#), I was thinking this would be a suitable regex pattern to use:

我正在使用 .NET 框架 (c#),我认为这将是一个合适的正则表达式模式:

#\w+

#\w+

Is this the best regex for this purpose?

这是用于此目的的最佳正则表达式吗?

采纳答案by bobbymcr

It depends on whether you want to match hashtags inside other strings ("Some#Word") or things that probably aren't hashtags ("We're #1"). The regex you gave #\w+will match in both these cases. If you slightly modify your regex to \B#\w\w+, you can eliminate these cases and only match hashtags of length greater than 1 on word boundaries.

这取决于您是要匹配其他字符串中的主题标签(“Some#Word”)还是可能不是主题标签的内容(“We're #1”)。您提供的正则表达式#\w+将在这两种情况下匹配。如果将正则表达式稍微修改为\B#\w\w+,则可以消除这些情况,并且仅匹配单词边界上长度大于 1 的主题标签。

回答by arcain

If you are pulling statuses containing hashtags from Twitter, you no longer need to find them yourself. You can now specify the include_entitiesparameter to have Twitter automatically call out mentions, links, and hashtags.

如果您从 Twitter 提取包含主题标签的状态,则不再需要自己找到它们。您现在可以指定include_entities参数让 Twitter 自动调出提及、链接和主题标签。

For example, take the following call to statuses/show:

例如,对statuses/show进行以下调用:

http://api.twitter.com/1/statuses/show/60183527282577408.json?include_entities=true

http://api.twitter.com/1/statuses/show/60183527282577408.json?include_entities=true

In the resultant JSON, notice the entitiesobject.

在生成的 JSON 中,注意实体对象。

"entities":{"urls":[{"expanded_url":null,"indices":[68,88],"url":"http:\/\/bit.ly\/gWZmaJ"}],"user_mentions":[],"hashtags":[{"text":"wordpress","indices":[89,99]}]}

You can use the above to locate the specific entities in the tweet (which occur between the string positions denoted by the indicesproperty) and transform them appropriately.

您可以使用上面的方法来定位推文中的特定实体(出现在由索引属性表示的字符串位置之间)并适当地转换它们。

If you just need the regular expression to locate the hashtags, Twitter provides these in an open source library.

如果您只需要正则表达式来定位主题标签,Twitter 在开源库中提供了这些。

Hashtag Match Pattern

标签匹配模式

(^|[^&\p{L}\p{M}\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7])(#|\uFF03)(?!\uFE0F|\u20E3)([\p{L}\p{M}\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*[\p{L}\p{M}][\p{L}\p{M}\p{Nd}_\u200c\u200d\ua67e\u05be\u05f3\u05f4\u309b\u309c\u30a0\u30fb\u3003\u0f0b\u0f0c\u00b7]*)

The above pattern can be pieced together from thisjava file (retrieved 2015-11-23). Validation tests for this pattern are located in this filearound line 128.

上面的模式可以从这个java文件(检索2015-11-23)拼凑起来。此模式的验证测试位于此文件的第 128 行附近。

回答by Kevin Mark

After looking at the previous answers here and making some test tweets to see what Twitter liked, I think I've come up with a solid regular expression that should do the trick. It requires lookaround functionality in the regular expression engine, so it might not work with all engines out there. It should still work fine for .NET and PCRE.

在这里查看了之前的答案并制作了一些测试推文以了解 Twitter 喜欢什么之后,我想我已经想出了一个可靠的正则表达式来解决这个问题。它需要正则表达式引擎中的环视功能,因此它可能不适用于所有引擎。对于 .NET 和 PCRE,它应该仍然可以正常工作。

(?:(?<=\s)|^)#(\w*[A-Za-z_]+\w*)

According to RegexBuddy, this does the following: RegexBuddy Create View

根据 RegexBuddy 的说法,这会执行以下操作: RegexBuddy Create View

And again, according to RegexBuddy, here is what it matches: RegexBuddy Test View

再一次,根据 RegexBuddy 的说法,这是它匹配的内容: RegexBuddy Test View

Anything highlighted is part of the match. The darker highlighted part indicates what is returned from the capture.

突出显示的任何内容都是匹配的一部分。较暗突出显示的部分表示从捕获中返回的内容。

Edit Dec 2014:
Here's a slightly simplified version from zero323 that should be functionally equivalent:

2014 年 12 月编辑:
这是 zero323 的一个稍微简化的版本,应该在功能上等效:

(?<=\s|^)#(\w*[A-Za-z_]+\w*)

回答by go minimal

I tweeted a string with randomly placed hash tags, saw what Twitter did with it, and then tried to match it with a regular expression. Here's what I got:

我在推特上发布了一个带有随机放置的哈希标签的字符串,看到 Twitter 对它做了什么,然后尝试将它与正则表达式匹配。这是我得到的:

\B#\w*[a-zA-Z]+\w*

\B#\w*[a-zA-Z]+\w*

#face#Fa!ce something #iam#1 #1 #919 #jifdosajsomethin#idfsjoa 9#9#98 9#9f9j#9jlasdjl #jklfdsajl34#34239 #jkf#a*#1j3rj3

#face #Fa!ce something #iam#1 #1 #919 #jifdosajsomethin#idfsjoa 9#9#98 9#9f9j#9jlasdjl #jklfdsajl34#34239 #jkf#a *#1j3rj3

回答by Homer6

As far as I can tell, this pattern works the best. The others posted here don't take into account that a hashtag starting with numbers is invalid. Please ensure that you only use the second capturing group when you extract the hashtag.

据我所知,这种模式效果最好。此处发布的其他人没有考虑到以数字开头的主题标签是无效的。请确保在提取主题标签时仅使用第二个捕获组。

(^|\s)#([A-Za-z_][A-Za-z0-9_]*)

Note, I've also explicitly limited lookaheads and lookbehinds because of their performance penalties.

请注意,由于性能损失,我还明确限制了前瞻和后视。

enter image description here

enter image description here

回答by Leo Cavalcante

this is what I use:

这就是我使用的:

/#(\w*[0-9a-zA-Z]+\w*[0-9a-zA-Z])/g

link of the hashtag Regex to test

用于测试的主题标签 Regex 的链接

CavalcanteLeo

CavalcanteLeo

回答by Yifan

/#((\w|[\u00C0-\uFFDF])+)/g

/#((\w|[\u00C0-\uFFDF])+)/g

reference: Unicode Table

参考:Unicode 表

回答by Carter Cole

this is the one i wrote it looks for word boundaries and only matches hash text (?<=#)\w*?(?=\W).

这是我写的一个,它寻找单词边界并且只匹配哈希文本(?<=#)\w*?(?=\W)

回答by Gabriel Magno

I've tested some tweets, and realized that hashtags:

我测试了一些推文,并意识到主题标签:

  • Are composed by alphanumeric characters plus underscore.
  • Must have at least 1 letter or underscore.
  • May have the dot character, but the hashtag will be interpreted as a link to an external site. (I do not consider this)
  • 由字母数字字符加下划线组成。
  • 必须至少有 1 个字母或下划线。
  • 可能有点字符,但主题标签将被解释为指向外部站点的链接。(我不考虑这个)

So, that's what I've got:

所以,这就是我所拥有的:

\B#(\w*[A-Za-z_]+\w*)