使用java从字符串中删除html标签

Question

提问by Maverick

I am writing one program which reads and separate spam and ham emails. Now I am reading it using bufferedreader class of java. I am able to remove any unwanted characters like '(' or '.' etc, using replaceAll() method. I want to remove html tags too, including &amp. How to achieve this!?

我正在编写一个程序来读取和分离垃圾邮件和火腿电子邮件。现在我正在使用 java 的 bufferedreader 类读取它。我可以使用 replaceAll() 方法删除任何不需要的字符，如 '(' 或 '.' 等。我也想删除 html 标签，包括 &。如何实现这一点！？

thanks

谢谢

EDIT: Thanks for the response, but I am already having a regex, how to combine both my needs and put into one. Heres the regex i am using now.

编辑：感谢您的回应，但我已经有了一个正则表达式，如何将我的需求结合起来并合二为一。这是我现在使用的正则表达式。

lines.replaceAll("[^a-zA-Z]", " ")

Note: I am getting lines from a txt file. Any other suggestions plss?!

注意：我从 txt 文件中获取行。还有其他建议吗？！

Answer 1

回答by vanneto

Maybe this will work:

也许这会奏效：

String noHTMLString = htmlString.replaceAll("\<.*?>","");

It uses regular expressionsto remove all HTML tags in a string.

它使用正则表达式来删除字符串中的所有 HTML 标签。

More specifically, it removes all XML like tags from a string. So <1234> will be removed even though it is not a valid HTML tag. But it's good for most intents and purposes.

更具体地说，它从字符串中删除所有类似 XML 的标签。因此，即使 <1234> 不是有效的 HTML 标记，也将被删除。但这对大多数意图和目的都有好处。

Hope this helps.

希望这可以帮助。

Answer 2

回答by Kurt Kaylor

You will want to do some lightweight parsing to strip the HTML:

你需要做一些轻量级的解析来去除 HTML：

String extractText(String html) throws IOException {
    final ArrayList<String> list = new ArrayList<String>();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
        public void handleText(final char[] data, final int pos) { 
            list.add(new String(data));
        }
        public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
        public void handleEndTag(Tag t, final int pos) {  }
        public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
        public void handleComment(final char[] data, final int pos) { }
        public void handleError(final java.lang.String errMsg, final int pos) { }
    };
    parserDelegator.parse(new StringReader(html), parserCallback, true);

    String text = "";

    for(String s : list) {
        text += " " + s;
    }

    return text;
}

Answer 3

回答by Program-Me-Rev

JSOUP

汤

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Answer 4

回答by Jitendra

import java.io.*;

导入 java.io.*;

public class Html2TextWithRegExp {


public static void main (String[] args) throws Exception{
 StringBuilder sb = new StringBuilder();
 BufferedReader br = new BufferedReader(new FileReader("java-new.html"));
 String line;
 while ( (line=br.readLine()) != null) {
   sb.append(line);
   // or
   //  sb.append(line).append(System.getProperty("line.separator"));
 }
 String nohtml = sb.toString().replaceAll("\<.*?>","");
 System.out.println(nohtml);
 }
}

使用java从字符串中删除html标签

提问by Maverick

回答by vanneto

回答by Kurt Kaylor

回答by Program-Me-Rev

回答by Jitendra

相关推荐

最近更新

标签

使用java从字符串中删除html标签

提问by Maverick

回答by vanneto

回答by Kurt Kaylor

回答by Program-Me-Rev

回答by Jitendra

相关推荐

Java 我们可以在运行时更改 log4j 的日志记录级别吗

防止 HashMap/HashTable 中重复 <Key,Value> 对的 Java 代码

java.lang.UnsatisfiedLinkError：无法在 Windows x86 机器的 32 位 JVM 上加载 64 位 SWT 库

如何使用一些分隔符拆分字符串但不删除 Java 中的分隔符？

相关推荐

最近更新

标签