php 正则表达式匹配逗号之间的文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19512586/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 19:37:29  来源:igfitidea点击:

Regex to match text between commas

phpregex

提问by SkarXa

I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.

我正在疯狂尝试使用正则表达式来检测用户输入中的关键字垃圾邮件。通常开头是一些普通文本,结尾是关键字垃圾邮件,用逗号或其他字符分隔。

What I need is a regex to count the number of keywordsto flag the text for a human to check it.

我需要的是一个正则表达式来计算关键字的数量来标记文本以供人类检查。

The text is usually like this:

文字通常是这样的:

[random text, with commas, dots and all]

keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...

I've tried several regex to count the matches:

我尝试了几个正则表达式来计算匹配:

-This only gets one out of two keywords

- 这只能得到两个关键字中的一个

[,-](\w|\s)+[,-]

-This also matches the random text

- 这也匹配随机文本

(?:([^,-]*)(?:[^,-]|$))

Can anyone tell me a regex to do this? Or should I take a different approach?

谁能告诉我一个正则表达式来做到这一点?或者我应该采取不同的方法?

Thanks!

谢谢!

采纳答案by Jeroen

I think the difficulty is that the random text can also contain commas.

我认为困难在于随机文本也可以包含逗号。

If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.

如果关键字都在一行上并且它是整个文本的最后一行,则修剪整个文本并从末尾删除换行符。然后将文本从最后一个换行符到末尾。这应该是包含关键字的字符串。一旦你把这部分挑出来,你可以用逗号分解字符串并计算部分。

<?php
$string = " some gibberish, some more gibberish, and random text

keyword1, keyword2, keyword3

";

$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);

echo "Number of keywords: " . count($keywords);

I know it is not a regex, but I hope it helps nevertheless.

我知道它不是正则表达式,但我希望它仍然有帮助。

The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.

找到解决方案的唯一方法是找到将随机文本和关键字中不存在的关键字分开的东西。如果关键字中有新行,则不能使用它。但是2个连续的新行吗?或任何其他字符。

$string = " some gibberish, some more gibberish, and random text

keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9

";

$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);

echo "Number of keywords: " . count($keywords);

(edit: added example for more new lines - long shot)

(编辑:添加了更多新行的示例 - 远景)

回答by Taemyr

Pr your answer to my question, here is a regexp to match a string that occurs between two commas.

请回答我的问题,这是一个正则表达式,用于匹配出现在两个逗号之间的字符串。

(?<=,)[^,]+(?=,)

This regexp does not match, and hence do not consume, the delimiting commas. This regexp would match " and hence do not consume" in the previous sentence.

此正则表达式不匹配,因此不使用分隔逗号。此正则表达式将匹配上一句中的“因此不消耗”。

The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.

您的正则表达式匹配并消耗逗号的事实是您尝试的正则表达式仅匹配所有其他候选者的原因。

Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;

此外,如果整个输入是单个字符串,您将需要防止换行。在这种情况下,您将需要使用;

(?<=,)[^,\n]+(?=,)

http://www.phpliveregex.com/p/1DJ

http://www.phpliveregex.com/p/1DJ

回答by Steven

As others have said this is potentially a verytricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...

正如其他人所说,这可能是一件非常棘手的事情......它遭受与一般“单词过滤”相同的所有失败(例如,人们会“屏蔽”输入)。如果没有大量的示例帖子来测试,它会变得更加困难......

Solution

解决方案

Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:

无论如何,假设关键字将在与输入的其余部分分开的行上并用逗号分隔,您可以将这些行与关键字进行匹配:

Regex

正则表达式

#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m

Input

输入

Taken from your question above:

取自你上面的问题:

[random text, with commas, dots and all]

keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8

[随机文本,带逗号、点等]

关键字 1、关键字 2、
关键字 3、关键字 4、关键字5、关键字 6、关键字 7、关键字 8

Output

输出

// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);

array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
    [1]=>
    string(31) "Keyword6, keyword7, keyword8..."
  }
  [1]=>
  array(2) {
    [0]=>
    string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
    [1]=>
    string(31) "Keyword6, keyword7, keyword8"
  }
}

Explanation

解释

#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
  1. #=> Starting delimiter
  2. (?:^)=> Matches start of line in a non-capturing group (you could just use ^I was using |\noriginally and didn't update)
  3. (=> Start a capturing group
  4. (?:=> Start a non-capturing group
  5. (?:[\w]+)=> A non-capturing group to match one or moreword characters a-zA-Z0-9_(Using a character class so that you can add to it if you need to....)
  6. (?:, ?|$)=> A non-capturing group to match either a comma (with an optional space) or the end of the string/line
  7. )+=> End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
  8. )=> Close the capture group 3
  9. #=> Ending delimiter
  10. m=> Multi-line modifier
  1. #=> 起始分隔符
  2. (?:^)=> 匹配非捕获组中的行首(您可以使用^我最初使用的|\n并且没有更新)
  3. (=> 开始一个捕获组
  4. (?:=> 开始一个非捕获组
  5. (?:[\w]+)=> 匹配一个或多个单词字符的非捕获组a-zA-Z0-9_(使用字符类,以便您可以在需要时添加....)
  6. (?:, ?|$)=> 非捕获组以匹配逗号(带有可选空格)或字符串/行的结尾
  7. )+=> 结束非捕获组 (4) 并重复 5/6 以找到该行中的多个匹配项
  8. )=> 关闭捕获组 3
  9. #=> 结束分隔符
  10. m=> 多行修饰符

Follow up from number 2:

从第 2 项跟进:

#^((?:(?:[\w]+)(?:, ?|$))+)#m


Counting keywords

计算关键字

Having now returned an array of lines onlycontaining key words you can count the number of commas and thus get the number of keywords

现在返回了一个包含关键字的行数组,您可以计算逗号的数量,从而获得关键字的数量

$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ',');      // 8

N.B.In most circumstances this will return NUMBER_OF_KEY_WORDS - 1(i.e. in your case 7); it returns 8because you have a comma at the end of your first line of key words.

注意在大多数情况下,这会返回NUMBER_OF_KEY_WORDS - 1(即在您的情况下为 7);它返回8是因为您在第一行关键字的末尾有一个逗号。



Links

链接

http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count

http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count

回答by GordonM

Why not just use explode and trim?

为什么不直接使用爆炸和修剪?

$keywords = array_map ('trim', explode (',', $keywordstring));

Then do a count() on $keywords.

然后对 $keywords 执行 count()。

If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tickor Iron Manas a keyword

如果您认为带有空格的关键字是垃圾邮件,那么您可以迭代 $keywords 数组并查找任何包含空格的关键字。不过,在关键字中包含空格可能有正当理由。例如,如果您在谈论系统上的超级英雄,则有人可能会输入The TickIron Man作为关键字

I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.

我不认为计算关键字并在关键字中寻找空格是检测垃圾邮件的非常好的策略。您可能想要查看其他机器人保护策略,甚至使用手动审核。

回答by MC ND

Your first regexp doesn't need a preceding comma

您的第一个正则表达式不需要前面的逗号

[\w\s]+[,-]