使用分隔符选项卡“\t”在 Java 中使用 split 进行字符串解析

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1635764/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 18:03:00  来源:igfitidea点击:

String parsing in Java with delimiter tab "\t" using split

javastringtab-delimited

提问by lakhaman

I'm processing a string which is tab delimited. I'm accomplishing this using the splitfunction, and it works in most situations. The problem occurs when a field is missing, so instead of getting null in that field I get the next value. I'm storing the parsed values in a string array.

我正在处理一个由制表符分隔的字符串。我正在使用该split函数完成此操作 ,并且在大多数情况下都可以使用。当某个字段丢失时会出现问题,因此我不会在该字段中获取空值,而是获取下一个值。我将解析后的值存储在字符串数组中。

String[] columnDetail = new String[11];
columnDetail = column.split("\t");

Any help would be appreciated. If possible I'd like to store the parsed strings into a string array so that I can easily access the parsed data.

任何帮助,将不胜感激。如果可能,我想将解析的字符串存储到字符串数组中,以便我可以轻松访问解析的数据。

回答by Filip Ekberg

String.splituses Regular Expressions, also you don't need to allocate an extra array for your split.

String.split使用正则表达式,您也不需要为分割分配额外的数组。

The split-method will give you a list., the problem is that you try to pre-define how many occurrences you have of a tab, but how would you Really know that? Try using the Scanner or StringTokenizer and just learn how splitting strings work.

拆分方法会给你一个列表。,问题是您尝试预先定义一个选项卡出现的次数,但您如何真正知道这一点?尝试使用 Scanner 或 StringTokenizer 并了解拆分字符串的工作原理。

Let me explain Why \t does not workand why you need \\\\to escape \\.

让我解释一下为什么 \t 不起作用以及为什么您需要\\\\转义\\

Okay, so when you use Split, it actually takes a regex ( Regular Expression ) and in regular expression you want to define what Character to split by, and if you write \t that actually doesn't mean \tand what you WANT to split by is \t, right? So, by just writing \tyou tell your regex-processor that "Hey split by the character that is escaped t" NOT"Hey split by all characters looking like \t". Notice the difference? Using \ means to escape something. And \in regex means something Totally different than what you think.

好的,所以当你使用 Split 时,它实际上需要一个正则表达式(正则表达式),并且在正则表达式中你想定义要拆分的字符,如果你写 \t 这实际上并不意味着\t你想要拆分的内容是\t吧?因此,只需编写\t您就可以告诉您的正则表达式处理器“嘿,被转义的字符分割”而不是“嘿,被所有看起来像的字符分割\t”。注意到区别了吗?使用 \ 意味着逃避某事。而\在正则表达式的手段得到你所想的完全不同。

So this is why you need to use this Solution:

所以这就是您需要使用此解决方案的原因

\t

To tell the regex processor to look for \t. Okay, so why would you need two of em? Well, the first \ escapes the second, which means it will look like this: \t when you are processing the text!

告诉正则表达式处理器寻找\t。好的,那你为什么需要两个?好吧,第一个 \ 转义了第二个,这意味着它看起来像这样:当您处理文本时 \t!

Now let's say that you are looking to split \

现在假设您要拆分 \

Well then you would be left with \\ but see, that doesn't Work! because \ will try to escape the previous char! That is why you want the Output to be \\ and therefore you need to have \\\\.

那么你会留下 \\ 但看,这不起作用!因为 \ 会尝试转义之前的字符!这就是为什么您希望输出为 \\,因此您需要有 \\\\。

I really hope the examples above helps you understand why your solution doesn't work and how to conquer other ones!

我真的希望上面的例子可以帮助您理解为什么您的解决方案不起作用以及如何征服其他解决方案!

Now, I've given you this answerbefore, maybe you should start looking at them now.

现在,我之前已经给过你这个答案,也许你现在应该开始研究它们。

OTHER METHODS

其他方法

StringTokenizer

字符串标记器

You should look into the StringTokenizer, it's a very handy tool for this type of work.

您应该查看StringTokenizer,它是此类工作的一个非常方便的工具。

Example

例子

 StringTokenizer st = new StringTokenizer("this is a test");
 while (st.hasMoreTokens()) {
     System.out.println(st.nextToken());
 }

This will output

这将输出

 this
 is
 a
 test

You use the Second Constructor for StringTokenizer to set the delimiter:

您使用 StringTokenizer 的第二个构造函数来设置分隔符:

StringTokenizer(String str, String delim)

StringTokenizer(String str, String delim)

Scanner

扫描器

You could also use a Scanneras one of the commentators said this could look somewhat like this

您也可以使用扫描仪,因为其中一位评论员说这看起来有点像这样

Example

例子

 String input = "1 fish 2 fish red fish blue fish";

 Scanner s = new Scanner(input).useDelimiter("\s*fish\s*");

 System.out.println(s.nextInt());
 System.out.println(s.nextInt());
 System.out.println(s.next());
 System.out.println(s.next());

 s.close(); 

The output would be

输出将是

 1
 2
 red
 blue 

Meaning that it will cut out the word "fish" and give you the rest, using "fish" as the delimiter.

这意味着它会去掉“fish”这个词,剩下的给你,使用“fish”作为分隔符。

examples taken from the Java API

取自 Java API 的示例

回答by Luke Usherwood

String.splitimplementations will have serious limitations if the data in a tab-delimited field itself contains newline, tab and possibly " characters.

String.split如果制表符分隔字段中的数据本身包含换行符、制表符和可能的 " 字符,则实现将受到严重限制。

TAB-delimited formats have been around for donkey's years, but format is not standardised and varies. Many implementations don't escape characters (newlines and tabs) appearing within a field. Rather, they follow CSV conventions and wrap any non-trivial fields in "double quotes". Then they escape only double-quotes. So a "line" could extend over multiple lines.

制表符分隔的格式已经存在了很多年,但格式不是标准化的并且各不相同。许多实现不会对字段中出现的字符(换行符和制表符)进行转义。相反,它们遵循 CSV 约定并将任何非平凡字段用“双引号”括起来。然后他们只转义双引号。所以一条“线”可以延伸到多条线上。

Reading around I heard "just reuse apache tools", which sounds like good advice.

阅读周围我听到“只是重用 apache 工具”,这听起来是个好建议。

In the end I personally chose opencsv. I found it light-weight, and since it provides options for escape and quote characters it should cover most popular comma- and tab- delimited data formats.

最后我个人选择了opencsv。我发现它很轻量,因为它提供了转义和引号字符的选项,所以它应该涵盖最流行的逗号和制表符分隔的数据格式。

Example:

例子:

CSVReader tabFormatReader = new CSVReader(new FileReader("yourfile.tsv"), '\t');

回答by Happy3

Try this:

尝试这个:

String[] columnDetail = column.split("\t", -1);

Read the Javadoc on String.split(java.lang.String, int)for an explanation about the limit parameter of split function:

阅读String.split(java.lang.String, int)上的 Javadoc 以了解有关 split 函数的限制参数的说明:

split

public String[] split(String regex, int limit)
Splits this string around matches of the given regular expression.
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The string "boo:and:foo", for example, yields the following results with these parameters:

Regex   Limit   Result
:   2   { "boo", "and:foo" }
:   5   { "boo", "and", "foo" }
:   -2  { "boo", "and", "foo" }
o   5   { "b", "", ":and:f", "", "" }
o   -2  { "b", "", ":and:f", "", "" }
o   0   { "b", "", ":and:f" }

When the last few fields (I guest that's your situation) are missing, you will get the column like this:

当最后几个字段(我认为这是您的情况)丢失时,您将获得如下列:

field1\tfield2\tfield3\t\t

If no limit is set to split(), the limit is 0, which will lead to that "trailing empty strings will be discarded". So you can just get just 3 fields, {"field1", "field2", "field3"}.

如果 split() 没有设置限制,则限制为 0,这将导致“将丢弃尾随的空字符串”。所以你只能得到 3 个字段,{"field1", "field2", "field3"}。

When limit is set to -1, a non-positive value, trailing empty strings will not be discarded. So you can get 5 fields with the last two being empty string, {"field1", "field2", "field3", "", ""}.

当 limit 设置为 -1 时,一个非正值,尾随空字符串将不会被丢弃。所以你可以得到 5 个字段,最后两个是空字符串,{"field1", "field2", "field3", "", ""}。

回答by Mr_and_Mrs_D

Well nobody answered - which is in part the fault of the question : the input string contains eleven fields (this much can be inferred) but how many tabs ? Most possibly exactly10. Then the answer is

好吧,没有人回答 - 这部分是问题的错:输入字符串包含 11 个字段(可以推断出这么多)但是有多少个选项卡?很可能正好是10。那么答案是

String s = "\t2\t\t4\t5\t6\t\t8\t\t10\t";
String[] fields = s.split("\t", -1);  // in your case s.split("\t", 11) might also do
for (int i = 0; i < fields.length; ++i) {
    if ("".equals(fields[i])) fields[i] = null;
}
System.out.println(Arrays.asList(fields));
// [null, 2, null, 4, 5, 6, null, 8, null, 10, null]
// with s.split("\t") : [null, 2, null, 4, 5, 6, null, 8, null, 10]

If the fields happen to contain tabs this won't work as expected, of course.
The -1means : apply the pattern as many times as needed - so trailing fields (the 11th) will be preserved (as empty strings ("") if absent, which need to be turned to nullexplicitly).

如果字段碰巧包含选项卡,这当然不会按预期工作。
-1方法:应用模式作为根据需要多次-所以尾随字段(11日)将被保留(为空字符串("")如果不存在,这就需要进行转向null明确)。

If on the other hand there are no tabs for the missing fields - so "5\t6"is a valid input string containing the fields 5,6 only - there is no way to get the fields[]via split.

另一方面,如果缺少的字段没有选项卡 -"5\t6"仅包含字段 5,6 的有效输入字符串也是如此- 则无法获得fields[]通孔拆分。

回答by Ivan Marinov

I just had the same question and noticed the answer in some kind of tutorial. In general you need to use the second form of the split method, using the

我只是有同样的问题,并在某种教程中注意到了答案。一般来说,你需要使用split方法的第二种形式,使用

split(regex, limit)

split(regex, limit)

Here is the full tutorial http://www.rgagnon.com/javadetails/java-0438.html

这是完整的教程http://www.rgagnon.com/javadetails/java-0438.html

If you set some negative number for the limit parameter you will get empty strings in the array where the actual values are missing. To use this your initial string should have two copies of the delimiter i.e. you should have \t\t where the values are missing.

如果您为 limit 参数设置了一些负数,您将在缺少实际值的数组中得到空字符串。要使用它,您的初始字符串应具有分隔符的两个副本,即您应该在缺少值的地方使用 \t\t。

Hope this helps :)

希望这可以帮助 :)

回答by RickeyShao

You can use yourstring.split("\x09"); I tested it, and it works.

您可以使用 yourstring.split("\x09"); 我测试了它,它有效。