Java 如何使用 Jsoup 删除硬空间?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/21137892/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 06:51:39  来源:igfitidea点击:

How to remove hard spaces with Jsoup?

javajsoup

提问by Carlos Goce

I'm trying to remove hard spaces (from  entities in the HTML). I can't remove it with .trim()or .replace(" ", ""), etc! I don't get it.

我正在尝试删除硬空格(从 HTML 中的实体)。我不能用.trim()or.replace(" ", "")等删除它!我不明白。

I even found on Stackoverflow to try with \\u00a0but didn't work neither.

我什至在 Stackoverflow 上找到了尝试,\\u00a0但也没有奏效。

I tried this (since text()returns actual hard space characters, U+00A0):

我试过这个(因为text()返回实际的硬空间字符,U+00A0):

System.out.println( "'"+fields.get(6).text().replace("\u00a0", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().replace(" ", "")+"'" ); //'94,00 '
System.out.println( "'"+fields.get(6).text().trim()+"'"); //'94,00 '
System.out.println( "'"+fields.get(6).html().replace(" ", "")+"'"); //'94,00' works

But I can't figure out why I can't remove the white space with .text().

但我不明白为什么我不能用.text().

采纳答案by T.J. Crowder

Your first attempt was very nearlyit, you're quite right that Jsoup maps  to U+00A0. You just don't want the double backslash in your string:

您的第一次尝试非常接近,Jsoup 映射 到 U+00A0是完全正确的。您只是不希望字符串中出现双反斜杠:

System.out.println( "'"+fields.get(6).text().replace("\u00a0", "")+"'" ); //'94,00'
// Just one ------------------------------------------^

replacedoesn't use regular expressions, so you aren't trying to pass a literal backslash through to the regex level. You just want to specify character U+00A0 in the string.

replace不使用正则表达式,因此您不会尝试将文字反斜杠传递到正则表达式级别。您只想在字符串中指定字符 U+00A0。

回答by Ovokerie Ogbeta

The question has been edited to reflect the true problem.

该问题已被编辑以反映真正的问题。

New answer; The hardspace, ie. entity   (Unicode character NO-BREAK SPACE U+00A0 ) can in Java be represented by the character \u00a0,thus code becomes, where stris the string gotten from the text()method

新答案;硬空间,即。实体(Unicode 字符 NO-BREAK SPACE U+00A0 )在 Java 中可以用这样的字符表示,\u00a0,代码变成,strtext()方法中得到的字符串在哪里

str.replaceAll ("\u00a0", "");

Old answer; Using the JSoup library,

旧答案;使用 JSoup 库,

import org.jsoup.parser.Parser;

String str1 = Parser.unescapeEntities("last week, Ovokerie Ogbeta", false);
String str2 = Parser.unescapeEntities("Entered » Here", false);
System.out.println(str1 + " " + str2);

Prints out:

打印出来:

last week, Ovokerie Ogbeta Entered ? Here