在java中用空格标记一个字符串
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1501317/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Tokenize a string with a space in java
提问by kal
I want to tokenize a string like this
我想标记这样的字符串
String line = "a=b c='123 456' d=777 e='uij yyy'";
I cannot split based like this
我不能像这样拆分
String [] words = line.split(" ");
Any idea how can I split so that I get tokens like
知道如何拆分以便获得像
a=b
c='123 456'
d=777
e='uij yyy';
采纳答案by cletus
The simplest way to do this is by hand implementing a simple finite state machine. In other words, process the string a character at a time:
最简单的方法是手动实现一个简单的有限状态机。换句话说,一次处理一个字符的字符串:
- When you hit a space, break off a token;
- When you hit a quote keep getting characters until you hit another quote.
- 当你击中一个空格时,折断一个标记;
- 当您点击引用时,继续获取字符,直到您点击另一个引用。
回答by Sev
Depending on the formatting of your original string, you should be able to use a regular expression as a parameter to the java "split" method: Click here for an example.
根据原始字符串的格式,您应该能够使用正则表达式作为 java“split”方法的参数:单击此处查看示例。
The example doesn't use the regular expression that you would need for this task though.
但是,该示例不使用此任务所需的正则表达式。
You can also use this SO threadas a guideline (although it's in PHP) which does something very close to what you need. Manipulating that slightly might do the trick (although having quotes be part of the output or not may cause some issues). Keep in mind that regex is very similar in most languages.
您还可以使用此 SO 线程作为指南(尽管它是在 PHP 中),它执行的操作非常接近您的需要。稍微操纵它可能会起作用(尽管引号是否是输出的一部分可能会导致一些问题)。请记住,正则表达式在大多数语言中都非常相似。
Edit: going too much further into this type of task may be ahead of the capabilities of regex, so you may need to create a simple parser.
编辑:深入研究此类任务可能会超出正则表达式的功能,因此您可能需要创建一个简单的解析器。
回答by rajax
Have you tried splitting by '=' and creating a token out of each pair of the resulting array?
您是否尝试过按 '=' 拆分并从每对结果数组中创建一个标记?
回答by Stephen Denne
StreamTokenizercan help, although it is easiest to set up to break on '=', as it will always break at the start of a quoted string:
StreamTokenizer可以提供帮助,尽管最容易设置为在 '=' 上中断,因为它总是会在引用字符串的开头中断:
String s = "Ta=b c='123 456' d=777 e='uij yyy'";
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.ordinaryChars('0', '9');
st.wordChars('0', '9');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
switch (st.ttype) {
case StreamTokenizer.TT_NUMBER:
System.out.println(st.nval);
break;
case StreamTokenizer.TT_WORD:
System.out.println(st.sval);
break;
case '=':
System.out.println("=");
break;
default:
System.out.println(st.sval);
}
}
outputs
产出
Ta
=
b
c
=
123 456
d
=
777
e
=
uij yyy
If you leave out the two lines that convert numeric characters to alpha, then you get d=777.0
, which might be useful to you.
如果省略将数字字符转换为 alpha 的两行,则会得到d=777.0
,这可能对您有用。
回答by Raymond Kroeker
java.util.StringTokenizer tokenizer = new java.util.StringTokenizer(line, " ");
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
int index = token.indexOf('=');
String key = token.substring(0, index);
String value = token.substring(index + 1);
}
回答by hashable
Assumptions:
假设:
- Your variable name ('a' in the assignment 'a=b') can be of length 1 or more
- Your variable name ('a' in the assignment 'a=b') can not contain the space character, anything else is fine.
- Validation of your input is not required (input assumed to be in valid a=b format)
- 您的变量名称(赋值“a=b”中的“a”)的长度可以为 1 或更多
- 您的变量名(赋值“a=b”中的“a”)不能包含空格字符,其他都可以。
- 不需要验证您的输入(假定输入为有效的 a=b 格式)
This works fine for me.
这对我来说很好用。
Input:
输入:
a=b abc='123 456' &=777 #='uij yyy' ABC='slk slk' 123sdkljhSDFjflsakd@*#&=456sldSLKD)#(
Output:
输出:
a=b
abc='123 456'
&=777
#='uij yyy'
ABC='slk slk'
123sdkljhSDFjflsakd@*#&=456sldSLKD)#(
Code:
代码:
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
// SPACE CHARACTER followed by
// sequence of non-space characters of 1 or more followed by
// first occuring EQUALS CHARACTER
final static String regex = " [^ ]+?=";
// static pattern defined outside so that you don't have to compile it
// for each method call
static final Pattern p = Pattern.compile(regex);
public static List<String> tokenize(String input, Pattern p){
input = input.trim(); // this is important for "last token case"
// see end of method
Matcher m = p.matcher(input);
ArrayList<String> tokens = new ArrayList<String>();
int beginIndex=0;
while(m.find()){
int endIndex = m.start();
tokens.add(input.substring(beginIndex, endIndex));
beginIndex = endIndex+1;
}
// LAST TOKEN CASE
//add last token
tokens.add(input.substring(beginIndex));
return tokens;
}
private static void println(List<String> tokens) {
for(String token:tokens){
System.out.println(token);
}
}
public static void main(String args[]){
String test = "a=b " +
"abc='123 456' " +
"&=777 " +
"#='uij yyy' " +
"ABC='slk slk' " +
"123sdkljhSDFjflsakd@*#&=456sldSLKD)#(";
List<String> tokens = RegexTest.tokenize(test, p);
println(tokens);
}
}
回答by Brett Kail
This solution is both general and compact (it is effectively the regex version of cletus' answer):
这个解决方案既通用又紧凑(它实际上是 cletus 答案的正则表达式版本):
String line = "a=b c='123 456' d=777 e='uij yyy'";
Matcher m = Pattern.compile("('[^']*?'|\S)+").matcher(line);
while (m.find()) {
System.out.println(m.group()); // or whatever you want to do
}
In other words, find all runs of characters that are combinations of quoted strings or non-space characters; nested quotes are not supported (there is no escape character).
换句话说,查找所有包含引号字符串或非空格字符组合的字符;不支持嵌套引号(没有转义字符)。
回答by cherouvim
line.split(" (?=[a-z+]=)")
correctly gives:
正确给出:
a=b
c='123 456'
d=777
e='uij yyy'
Make sure you adapt the [a-z+] part in case your keys structure changes.
确保您调整 [a-z+] 部分,以防您的密钥结构发生变化。
Edit: this solution can fail miserably if there is a "=" character in the value part of the pair.
编辑:如果该对的值部分中有“=”字符,则此解决方案可能会失败。
回答by kmkswamy
public static void main(String[] args) {
String token;
String value="";
HashMap<String, String> attributes = new HashMap<String, String>();
String line = "a=b c='123 456' d=777 e='uij yyy'";
StringTokenizer tokenizer = new StringTokenizer(line," ");
while(tokenizer.hasMoreTokens()){
token = tokenizer.nextToken();
value = token.contains("'") ? value + " " + token : token ;
if(!value.contains("'") || value.endsWith("'")) {
//Split the strings and get variables into hashmap
attributes.put(value.split("=")[0].trim(),value.split("=")[1]);
value ="";
}
}
System.out.println(attributes);
}
output: {d=777, a=b, e='uij yyy', c='123 456'}
输出:{d=777, a=b, e='uij yyy', c='123 456'}
In this case continuous space will be truncated to single space in the value. here attributed hashmap contains the values
在这种情况下,值中的连续空格将被截断为单个空格。这里属性哈希图包含值
回答by asusu
Or, with a regex for tokenizing, and a little state machine that just adds the key/val to a map:
或者,使用用于标记化的正则表达式,以及仅将键/值添加到映射的小型状态机:
String line = "a = b c='123 456' d=777 e = 'uij yyy'";
Map<String,String> keyval = new HashMap<String,String>();
String state = "key";
Matcher m = Pattern.compile("(=|'[^']*?'|[^\s=]+)").matcher(line);
String key = null;
while (m.find()) {
String found = m.group();
if (state.equals("key")) {
if (found.equals("=") || found.startsWith("'"))
{ System.err.println ("ERROR"); }
else { key = found; state = "equals"; }
} else if (state.equals("equals")) {
if (! found.equals("=")) { System.err.println ("ERROR"); }
else { state = "value"; }
} else if (state.equals("value")) {
if (key == null) { System.err.println ("ERROR"); }
else {
if (found.startsWith("'"))
found = found.substring(1,found.length()-1);
keyval.put (key, found);
key = null;
state = "key";
}
}
}
if (! state.equals("key")) { System.err.println ("ERROR"); }
System.out.println ("map: " + keyval);
prints out
打印出来
map: {d=777, e=uij yyy, c=123 456, a=b}
It does some basic error checking, and takes the quotes off the values.
它进行一些基本的错误检查,并从值中去掉引号。