C# 除非在引号中,否则在空格上拆分的正则表达式
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/554013/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Regular Expression to split on spaces unless in quotes
提问by Shaun Bowe
I would like to use the .Net Regex.Split method to split this input string into an array. It must split on whitespace unless it is enclosed in a quote.
我想使用 .Net Regex.Split 方法将此输入字符串拆分为数组。除非用引号括起来,否则它必须在空白处拆分。
Input: Here is "my string" it has "six matches"
输入:这是“我的字符串”,它有“六个匹配项”
Expected output:
预期输出:
- Here
- is
- my string
- it
- has
- six matches
- 这里
- 是
- 我的字符串
- 它
- 已
- 六场比赛
What pattern do I need? Also do I need to specify any RegexOptions?
我需要什么模式?我还需要指定任何 RegexOptions 吗?
采纳答案by Bartek Szabat
No options required
不需要选项
Regex:
正则表达式:
\w+|"[\w\s]*"
C#:
C#:
Regex regex = new Regex(@"\w+|""[\w\s]*""");
Or if you need to exclude " characters:
或者,如果您需要排除 " 字符:
Regex
.Matches(input, @"(?<match>\w+)|\""(?<match>[\w\s]*)""")
.Cast<Match>()
.Select(m => m.Groups["match"].Value)
.ToList()
.ForEach(s => Console.WriteLine(s));
回答by Grzenio
EDIT: Sorry for my previous post, this is obviously possible.
编辑:抱歉我之前的帖子,这显然是可能的。
To handle all the non-alphanumeric characters you need something like this:
要处理所有非字母数字字符,您需要这样的东西:
MatchCollection matchCollection = Regex.Matches(input, @"(?<match>[^""\s]+)|\""(?<match>[^""]*)""");
foreach (Match match in matchCollection)
{
yield return match.Groups["match"].Value;
}
you can make the foreach smarter if you are using .Net >2.0
如果您使用 .Net >2.0,您可以使 foreach 更智能
回答by Lieven Keersmaekers
Shaun,
肖恩,
I believe the following regex should do it
我相信以下正则表达式应该这样做
(?<=")\w[\w\s]*(?=")|\w+
Regards,
Lieven
问候, 利
文
回答by John Conrad
This regex will split based on the case you have given above, although it does not strip the quotes or extra spaces, so you may want to do some post processing on your strings. This should correctly keep quoted strings together though.
此正则表达式将根据您上面给出的情况进行拆分,尽管它不会去除引号或额外的空格,因此您可能需要对字符串进行一些后期处理。不过,这应该正确地将带引号的字符串保持在一起。
"[^"]+"|\s?\w+?\s
回答by Liudvikas Bukys
With a little bit of messiness, regular languages can keep track of even/odd counting of quotes, but if your data can include escaped quotes (\") then you're in real trouble producing or comprehending a regular expression that will handle that correctly.
有一点混乱,常规语言可以跟踪引号的偶数/奇数计数,但是如果您的数据可以包含转义引号 (\"),那么您在生成或理解将正确处理的正则表达式时遇到麻烦.
回答by Adam L
Take a look at LSteinle's "Split Function that Supports Text Qualifiers" over at Code project
在代码项目中查看 LSteinle 的“支持文本限定符的拆分函数”
Here is the snippet from his project that you're interested in.
这是您感兴趣的他的项目的片段。
using System.Text.RegularExpressions;
public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
string _Statement = String.Format("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
Regex.Escape(delimiter), Regex.Escape(qualifier));
RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;
Regex _Expression = New Regex(_Statement, _Options);
return _Expression.Split(expression);
}
Just watch out for calling this in a loop as its creating and compiling the Regex statement every time you call it. So if you need to call it more then a handful of times, I would look at creating a Regex cache of some kind.
只要注意在循环中调用它,因为它每次调用它时都会创建和编译 Regex 语句。因此,如果您需要多次调用它,我会考虑创建某种类型的 Regex 缓存。
回答by Timothy Walters
Lieven's solution gets most of the way there, and as he states in his comments it's just a matter of changing the ending to Bartek's solution. The end result is the following working regEx:
Lieven 的解决方案大体上是这样,正如他在评论中所说,这只是将结局更改为 Bartek 的解决方案的问题。最终结果是以下工作正则表达式:
(?<=")\w[\w\s]*(?=")|\w+|"[\w\s]*"
Input: Here is "my string" it has "six matches"
输入:这是“我的字符串”,它有“六个匹配项”
Output:
输出:
- Here
- is
- "my string"
- it
- has
- "six matches"
- 这里
- 是
- “我的字符串”
- 它
- 已
- “六场比赛”
Unfortunately it's including the quotes. If you instead use the following:
不幸的是,它包括引号。如果您改为使用以下内容:
(("((?<token>.*?)(?<!\)")|(?<token>[\w]+))(\s)*)
And explicitly capture the "token" matches as follows:
并明确捕获“令牌”匹配如下:
RegexOptions options = RegexOptions.None;
Regex regex = new Regex( @"((""((?<token>.*?)(?<!\)"")|(?<token>[\w]+))(\s)*)", options );
string input = @" Here is ""my string"" it has "" six matches"" ";
var result = (from Match m in regex.Matches( input )
where m.Groups[ "token" ].Success
select m.Groups[ "token" ].Value).ToList();
for ( int i = 0; i < result.Count(); i++ )
{
Debug.WriteLine( string.Format( "Token[{0}]: '{1}'", i, result[ i ] ) );
}
Debug output:
调试输出:
Token[0]: 'Here'
Token[1]: 'is'
Token[2]: 'my string'
Token[3]: 'it'
Token[4]: 'has'
Token[5]: ' six matches'
回答by Brian W
If you'd like to take a look at a general solution to this problem in the form of a free, open-source javascript object, you can visit http://splitterjsobj.sourceforge.net/for a live demo (and download). The object has the following features:
如果您想以免费的开源 javascript 对象的形式查看此问题的一般解决方案,您可以访问http://splitterjsobj.sourceforge.net/进行现场演示(并下载) . 该对象具有以下特点:
- Pairs of user-defined quote characters can be used to escape the delimiter (prevent a split inside quotes). The quotes can be escaped with a user-defined escape char, and/or by "double quote escape." The escape char can be escaped (with itself). In one of the 5 output arrays (properties of the object), output is unescaped. (For example, if the escape char = /, "a///"b" is unescaped as a/"b)
- Split on an array of delimiters; parse a file in one call. (The output arrays will be nested.)
- All escape sequences recognized by javascript can be evaluated during the split process and/or in a preprocess.
- Callback functionality
- Cross-browser consistency
- 用户定义的引号字符对可用于转义分隔符(防止引号内的拆分)。引号可以使用用户定义的转义字符和/或“双引号转义”进行转义。转义字符可以被转义(用它自己)。在 5 个输出数组之一(对象的属性)中,输出是未转义的。(例如,如果转义字符 = /,则 "a///"b" 未转义为 a/"b)
- 在分隔符数组上拆分;一次调用即可解析文件。(输出数组将被嵌套。)
- javascript 识别的所有转义序列都可以在拆分过程和/或预处理过程中进行评估。
- 回调功能
- 跨浏览器一致性
The object is also available as a jQuery plugin, but as a new user at this site I can only include one link in this message.
该对象也可用作 jQuery 插件,但作为此站点的新用户,我只能在此消息中包含一个链接。
回答by Boinst
I was using Bartek Szabat's answer, but I needed to capture more than just "\w" characters in my tokens. To solve the problem, I modified his regex slightly, similar to Grzenio's answer:
我正在使用 Bartek Szabat 的答案,但我需要在我的令牌中捕获的不仅仅是“\w”字符。为了解决这个问题,我稍微修改了他的正则表达式,类似于 Grzenio 的回答:
Regular Expression: (?<match>[^\s"]+)|(?<match>"[^"]*")
C# String: (?<match>[^\s\"]+)|(?<match>\"[^\"]*\")
Bartek's code (which returns tokens stripped of enclosing quotes) becomes:
Bartek 的代码(返回去除了引号的标记)变为:
Regex
.Matches(input, "(?<match>[^\s\"]+)|(?<match>\"[^\"]*\")")
.Cast<Match>()
.Select(m => m.Groups["match"].Value)
.ToList()
.ForEach(s => Console.WriteLine(s));
回答by Richard Shepherd
The top answer doesn't quite work for me. I was trying to split this sort of string by spaces, but it looks like it splits on the dots ('.') as well.
最佳答案对我来说不太适用。我试图用空格分割这种字符串,但看起来它也在点('.')上分割。
"the lib.lib" "another lib".lib
I know the question asks about regexs, but I ended up writing a non-regex function to do this:
我知道这个问题是关于正则表达式的,但我最终编写了一个非正则表达式函数来做到这一点:
/// <summary>
/// Splits the string passed in by the delimiters passed in.
/// Quoted sections are not split, and all tokens have whitespace
/// trimmed from the start and end.
public static List<string> split(string stringToSplit, params char[] delimiters)
{
List<string> results = new List<string>();
bool inQuote = false;
StringBuilder currentToken = new StringBuilder();
for (int index = 0; index < stringToSplit.Length; ++index)
{
char currentCharacter = stringToSplit[index];
if (currentCharacter == '"')
{
// When we see a ", we need to decide whether we are
// at the start or send of a quoted section...
inQuote = !inQuote;
}
else if (delimiters.Contains(currentCharacter) && inQuote == false)
{
// We've come to the end of a token, so we find the token,
// trim it and add it to the collection of results...
string result = currentToken.ToString().Trim();
if (result != "") results.Add(result);
// We start a new token...
currentToken = new StringBuilder();
}
else
{
// We've got a 'normal' character, so we add it to
// the curent token...
currentToken.Append(currentCharacter);
}
}
// We've come to the end of the string, so we add the last token...
string lastResult = currentToken.ToString().Trim();
if (lastResult != "") results.Add(lastResult);
return results;
}