php 使用php preg_match(正则表达式)将camelCase单词拆分为单词

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4519739/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-25 13:17:51  来源:igfitidea点击:

Split camelCase word into words with php preg_match (Regular Expression)

phpregexstringpreg-match

提问by Good-bye

How would I go about splitting the word:

我将如何拆分这个词:

oneTwoThreeFour

into an array so that I can get:

放入一个数组中,以便我可以得到:

one Two Three Four

with preg_match?

preg_match

I tired this but it just gives the whole word

我累了,但它只是给出了整个词

$words = preg_match("/[a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/", $string, $matches)`;

回答by codaddict

You can also use preg_match_allas:

您还可以preg_match_all用作:

preg_match_all('/((?:^|[A-Z])[a-z]+)/',$str,$matches);

Explanation:

解释:

(        - Start of capturing parenthesis.
 (?:     - Start of non-capturing parenthesis.
  ^      - Start anchor.
  |      - Alternation.
  [A-Z]  - Any one capital letter.
 )       - End of non-capturing parenthesis.
 [a-z]+  - one ore more lowercase letter.
)        - End of capturing parenthesis.

回答by codaddict

You can use preg_splitas:

您可以preg_split用作:

$arr = preg_split('/(?=[A-Z])/',$str);

See it

看见

I'm basically splitting the input string just before the uppercase letter. The regex used (?=[A-Z])matches the point just before a uppercase letter.

我基本上是在大写字母之前拆分输入字符串。使用的正则表达式(?=[A-Z])匹配大写字母之前的点。

回答by ridgerunner

I know that this is an old question with an accepted answer, but IMHO there is a better solution:

我知道这是一个老问题,答案已被接受,但恕我直言,有一个更好的解决方案:

<?php // test.php Rev:20140412_0800
$ccWord = 'NewNASAModule';
$re = '/(?#! splitCamelCase Rev:20140412)
    # Split camelCase "words". Two global alternatives. Either g1of2:
      (?<=[a-z])      # Position is after a lowercase,
      (?=[A-Z])       # and before an uppercase letter.
    | (?<=[A-Z])      # Or g2of2; Position is after uppercase,
      (?=[A-Z][a-z])  # and before upper-then-lower case.
    /x';
$a = preg_split($re, $ccWord);
$count = count($a);
for ($i = 0; $i < $count; ++$i) {
    printf("Word %d of %d = \"%s\"\n",
        $i + 1, $count, $a[$i]);
}
?>

Note that this regex, (like codaddict's '/(?=[A-Z])/'solution - which works like a charm for well formed camelCase words), matches only a positionwithin the string and consumes no text at all. This solution has the additional benefit that it also works correctly for not-so-well-formed pseudo-camelcase words such as: StartsWithCapand: hasConsecutiveCAPS.

请注意,此正则表达式(就像 codacci 的'/(?=[A-Z])/'解决方案 - 对于格式良好的驼峰式单词的魅力一样),仅匹配字符串中的一个位置并且根本不消耗任何文本。这个解决方案还有一个额外的好处,它也可以正确处理格式不太好的伪驼峰词,例如:StartsWithCap和: hasConsecutiveCAPS

Input:

输入:

oneTwoThreeFour
StartsWithCap
hasConsecutiveCAPS
NewNASAModule

oneTwoThreeFour
StartsWithCap
hasConsecutiveCAPS
NewNASAModule

Output:

输出:

Word 1 of 4 = "one"
Word 2 of 4 = "Two"
Word 3 of 4 = "Three"
Word 4 of 4 = "Four"

Word 1 of 4 = "one"
Word 2 of 4 = "Two"
Word 3 of 4 = "Three"
Word 4 of 4 = "Four"

Word 1 of 3 = "Starts"
Word 2 of 3 = "With"
Word 3 of 3 = "Cap"

Word 1 of 3 = "Starts"
Word 2 of 3 = "With"
Word 3 of 3 = "Cap"

Word 1 of 3 = "has"
Word 2 of 3 = "Consecutive"
Word 3 of 3 = "CAPS"

Word 1 of 3 = "has"
Word 2 of 3 = "Consecutive"
Word 3 of 3 = "CAPS"

Word 1 of 3 = "New"
Word 2 of 3 = "NASA"
Word 3 of 3 = "Module"

Word 1 of 3 = "New"
Word 2 of 3 = "NASA"
Word 3 of 3 = "Module"

Edited: 2014-04-12:Modified regex, script and test data to correctly split: "NewNASAModule"case (in response to rr's comment).

编辑:2014-04-12:修改正则表达式、脚本和测试数据以正确拆分:"NewNASAModule"case(响应 rr 的评论)。

回答by blak3r

A functionized version of @ridgerunner's answer.

@ridgerunner 答案的功能化版本。

/**
 * Converts camelCase string to have spaces between each.
 * @param $camelCaseString
 * @return string
 */
function fromCamelCase($camelCaseString) {
        $re = '/(?<=[a-z])(?=[A-Z])/x';
        $a = preg_split($re, $camelCaseString);
        return join($a, " " );
}

回答by rr-

While ridgerunner's answer works great, it seems not to work with all-caps substrings that appear in the middle of sentence. I use following and it seems to deal with these just alright:

虽然 ridgerunner 的答案效果很好,但它似乎不适用于出现在句子中间的全大写子字符串。我使用以下内容,似乎可以很好地处理这些问题:

function splitCamelCase($input)
{
    return preg_split(
        '/(^[^A-Z]+|[A-Z][^A-Z]+)/',
        $input,
        -1, /* no limit for replacement count */
        PREG_SPLIT_NO_EMPTY /*don't return empty elements*/
            | PREG_SPLIT_DELIM_CAPTURE /*don't strip anything from output array*/
    );
}

Some test cases:

一些测试用例:

assert(splitCamelCase('lowHigh') == ['low', 'High']);
assert(splitCamelCase('WarriorPrincess') == ['Warrior', 'Princess']);
assert(splitCamelCase('SupportSEELE') == ['Support', 'SEELE']);
assert(splitCamelCase('LaunchFLEIAModule') == ['Launch', 'FLEIA', 'Module']);
assert(splitCamelCase('anotherNASATrip') == ['another', 'NASA', 'Trip']);

回答by ArtisticPheonix

$string = preg_replace( '/([a-z0-9])([A-Z])/', " ", $string );

The trick is a repeatable pattern $1 $2$1 $2 or lower UPPERlower UPPERlower etc.... for example helloWorld = $1 matches "hello", $2 matches "W" and $1 matches "orld" again so in short you get $1 $2$1 or "hello World", matches HelloWorld as $2$1 $2$1 or again "Hello World". Then you can lower case them uppercase the first word or explode them on the space, or use a _ or some other character to keep them separate.

诀窍是一个可重复的模式 $1 $2$1 $2 或更低的 UPPERlower UPPERlower 等等......例如 helloWorld = $1 匹配“hello”,$2 匹配“W”和 $1 再次匹配“orld”所以简而言之,你得到 $1 $2$1 或“hello World”,将 HelloWorld 匹配为 $2$1 $2$1 或再次匹配“Hello World”。然后你可以将它们小写,大写第一个单词或在空格上分解它们,或者使用 _ 或其他一些字符将它们分开。

Short and simple.

简短而简单。

回答by mickmackusa

When determining the best pattern for your project, you will need to consider the following pattern factors:

在为您的项目确定最佳模式时,您需要考虑以下模式因素:

  1. Accuracy (Robustness) -- whether the pattern is correct in all cases and is reasonably future-proof
  2. Efficiency -- the pattern should be direct, deliberate, and avoid unnecessary labor
  3. Brevity -- the pattern should use appropriate techniques to avoid unnecessary character length
  4. Readability -- the pattern should be keep as simple as possible
  1. 准确性(鲁棒性)——模式是否在所有情况下都是正确的并且是合理的面向未来
  2. 效率——模式应该是直接的、深思熟虑的,避免不必要的劳动
  3. 简洁——模式应该使用适当的技术来避免不必要的字符长度
  4. 可读性——模式应该尽可能简单

The above factors also happen to be in the hierarchical order that strive to obey. In other words, it doesn't make much sense to me to prioritize 2, 3, or 4 when 1 doesn't quite satisfy the requirements. Readability is at the bottom of the list for me because in most cases I can follow the syntax.

上述因素也恰好处于努力服从的等级秩序中。换句话说,当 1 不能完全满足要求时,优先考虑 2、3 或 4 对我来说没有多大意义。可读性对我来说是最重要的,因为在大多数情况下我可以遵循语法。

Capture Groups and Lookarounds often impact pattern efficiency. The truth is, unless you are executing this regex on thousands of input strings, there is no need to toil over efficiency. It is perhaps more important to focus on pattern readability which can be associated with pattern brevity.

捕获组和环视通常会影响模式效率。事实是,除非您在数以千计的输入字符串上执行此正则表达式,否则无需为效率操劳。关注与模式简洁相关的模式可读性可能更重要。

Some patterns below will require some additional handling/flagging by their preg_function, but here are some pattern comparisons based on the OP's sample input:

下面的一些模式将需要通过其preg_功能进行一些额外的处理/标记,但这里有一些基于 OP 示例输入的模式比较:

preg_split()patterns:

preg_split()图案:

  • /^[^A-Z]+\K|[A-Z][^A-Z]+\K/(21 steps)
  • /(^[^A-Z]+|[A-Z][^A-Z]+)/(26 steps)
  • /[^A-Z]+\K(?=[A-Z])/(43 steps)
  • /(?=[A-Z])/(50 steps)
  • /(?=[A-Z]+)/(50 steps)
  • /([a-z]{1})[A-Z]{1}/(53 steps)
  • /([a-z0-9])([A-Z])/(68 steps)
  • /(?<=[a-z])(?=[A-Z])/x(94 steps) ...for the record, the xis useless.
  • /(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/(134 steps)
  • /^[^A-Z]+\K|[A-Z][^A-Z]+\K/(21 步)
  • /(^[^A-Z]+|[A-Z][^A-Z]+)/(26 步)
  • /[^A-Z]+\K(?=[A-Z])/(43 步)
  • /(?=[A-Z])/(50 步)
  • /(?=[A-Z]+)/(50 步)
  • /([a-z]{1})[A-Z]{1}/(53 步)
  • /([a-z0-9])([A-Z])/(68 步)
  • /(?<=[a-z])(?=[A-Z])/x(94 步) ...为了记录,这x是没用的。
  • /(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/(134 步)

preg_match_all()patterns:

preg_match_all()图案:

  • /[A-Z]?[a-z]+/(14 steps)
  • /((?:^|[A-Z])[a-z]+)/(35 steps)
  • /[A-Z]?[a-z]+/(14 步)
  • /((?:^|[A-Z])[a-z]+)/(35 步)

I'll point out that there is a subtle difference between the output of preg_match_all()and preg_split(). preg_match_all()will output a 2-dimensional array, in other words, all of the fullstring matches will be in the [0]subarray; if there is a capture group used, those substrings will be in the [1]subarray. On the other hand, preg_split()only outputs a 1-dimensional array and therefore provides a less bloated and more direct path to the desired output.

我会指出,preg_match_all()和的输出之间存在细微差别preg_split()preg_match_all()将输出一个二维数组,换句话说,所有的全字符串匹配都将在[0]子数组中;如果使用了捕获组,则这些子字符串将位于[1]子数组中。另一方面,preg_split()只输出一个一维数组,因此提供了一个不那么臃肿和更直接的到达所需输出的路径。

Some of the patterns are insufficient when dealing with camelCase strings that contain an ALLCAPS/acronym substring in them. If this is a fringe case that is possible within your project, it is logical to only consider patterns that handle these cases correctly. I will not be testing TitleCase input strings because that is creeping too far from the question.

在处理包含 ALLCAPS/acronym 子字符串的驼峰字符串时,某些模式是不够的。如果这是您项目中可能出现的边缘情况,那么只考虑正确处理这些情况的模式是合乎逻辑的。我不会测试 TitleCase 输入字符串,因为这离问题太远了。

New Extended Battery of Test Strings:

新的扩展测试字符串电池:

oneTwoThreeFour
hasConsecutiveCAPS
newNASAModule
USAIsGreatAgain 

Suitable preg_split()patterns:

合适的preg_split()图案:

  • /[a-z]+\K|(?=[A-Z][a-z]+)/(149 steps) *I had to use [a-z]for the demo to count properly
  • /(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/(547 steps)
  • /[a-z]+\K|(?=[A-Z][a-z]+)/(149 步)*我必须使用[a-z]演示才能正确计数
  • /(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/(547 步)

Suitable preg_match_all()pattern:

合适的preg_match_all()图案:

  • /[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/(75 steps)
  • /[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/(75 步)

Finally, my recommendations based on my pattern principles / factor hierarchy. Also, I recommend preg_split()over preg_match_all()(despite the patterns having less steps) as a matter of directness to the desired output structure. (of course, choose whatever you like)

最后,我的建议基于我的模式原则/因素层次结构。另外,我建议preg_split()preg_match_all()(虽然具有较少的步骤的图案)作为直接的期望的输出结构的问题。(当然,你喜欢什么就选什么)

Code: (Demo)

代码:(演示

$noAcronyms = 'oneTwoThreeFour';
var_export(preg_split('~^[^A-Z]+\K|[A-Z][^A-Z]+\K~', $noAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+~', $noAcronyms, $out) ? $out[0] : []);

Code: (Demo)

代码:(演示

$withAcronyms = 'newNASAModule';
var_export(preg_split('~[^A-Z]+\K|(?=[A-Z][^A-Z]+)~', $withAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+|[A-Z]+(?=[A-Z][^A-Z]|$)~', $withAcronyms, $out) ? $out[0] : []);

回答by Jarrod

I took cool guy Ridgerunner's code (above) and made it into a function:

我把很酷的家伙 Ridgerunner 的代码(上面)变成了一个函数:

echo deliciousCamelcase('NewNASAModule');

function deliciousCamelcase($str)
{
    $formattedStr = '';
    $re = '/
          (?<=[a-z])
          (?=[A-Z])
        | (?<=[A-Z])
          (?=[A-Z][a-z])
        /x';
    $a = preg_split($re, $str);
    $formattedStr = implode(' ', $a);
    return $formattedStr;
}

This will return: New NASA Module

这将返回: New NASA Module

回答by Kobi

Another option is matching /[A-Z]?[a-z]+/- if you know your input is on the right format, it should work nicely.

另一种选择是匹配/[A-Z]?[a-z]+/- 如果您知道您的输入格式正确,它应该可以很好地工作。

[A-Z]?would match an uppercase letter (or nothing). [a-z]+would then match all following lowercase letters, until the next match.

[A-Z]?将匹配一个大写字母(或什么都不匹配)。[a-z]+然后将匹配所有后面的小写字母,直到下一个匹配。

Working example: https://regex101.com/r/kNZfEI/1

工作示例:https: //regex101.com/r/kNZfEI/1

回答by Daniel Rhodes

You can split on a "glide" from lowercase to uppercase thus:

您可以将“滑动”从小写拆分为大写,因此:

$parts = preg_split('/([a-z]{1})[A-Z]{1}/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);        
//PREG_SPLIT_DELIM_CAPTURE to also return bracketed things
var_dump($parts);

Annoyingly you will then have to rebuild the words from each corresponding pair of items in $parts

令人讨厌的是,您将不得不从 $parts 中每个对应的项目对中重建单词

Hope this helps

希望这可以帮助