使用java从字符串中删除html标签
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/4432560/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
remove html tags from string using java
提问by Maverick
I am writing one program which reads and separate spam and ham emails. Now I am reading it using bufferedreader class of java. I am able to remove any unwanted characters like '(' or '.' etc, using replaceAll() method. I want to remove html tags too, including &. How to achieve this!?
我正在编写一个程序来读取和分离垃圾邮件和火腿电子邮件。现在我正在使用 java 的 bufferedreader 类读取它。我可以使用 replaceAll() 方法删除任何不需要的字符,如 '(' 或 '.' 等。我也想删除 html 标签,包括 &。如何实现这一点!?
thanks
谢谢
EDIT: Thanks for the response, but I am already having a regex, how to combine both my needs and put into one. Heres the regex i am using now.
编辑:感谢您的回应,但我已经有了一个正则表达式,如何将我的需求结合起来并合二为一。这是我现在使用的正则表达式。
lines.replaceAll("[^a-zA-Z]", " ")
Note: I am getting lines from a txt file. Any other suggestions plss?!
注意:我从 txt 文件中获取行。还有其他建议吗?!
回答by vanneto
Maybe this will work:
也许这会奏效:
String noHTMLString = htmlString.replaceAll("\<.*?>","");
It uses regular expressionsto remove all HTML tags in a string.
它使用正则表达式来删除字符串中的所有 HTML 标签。
More specifically, it removes all XML like tags from a string. So <1234> will be removed even though it is not a valid HTML tag. But it's good for most intents and purposes.
更具体地说,它从字符串中删除所有类似 XML 的标签。因此,即使 <1234> 不是有效的 HTML 标记,也将被删除。但这对大多数意图和目的都有好处。
Hope this helps.
希望这可以帮助。
回答by Kurt Kaylor
You will want to do some lightweight parsing to strip the HTML:
你需要做一些轻量级的解析来去除 HTML:
String extractText(String html) throws IOException {
final ArrayList<String> list = new ArrayList<String>();
ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
public void handleText(final char[] data, final int pos) {
list.add(new String(data));
}
public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
public void handleEndTag(Tag t, final int pos) { }
public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
public void handleComment(final char[] data, final int pos) { }
public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(new StringReader(html), parserCallback, true);
String text = "";
for(String s : list) {
text += " " + s;
}
return text;
}
回答by Program-Me-Rev
回答by Jitendra
import java.io.*;
导入 java.io.*;
public class Html2TextWithRegExp {
public static void main (String[] args) throws Exception{
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader("java-new.html"));
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
// or
// sb.append(line).append(System.getProperty("line.separator"));
}
String nohtml = sb.toString().replaceAll("\<.*?>","");
System.out.println(nohtml);
}
}