Java 从字符串中删除 HTML 标签

Question

提问by Mason

Is there a good way to remove HTML from a Java string? A simple regex like

有没有一种从 Java 字符串中删除 HTML 的好方法？一个简单的正则表达式，如

 replaceAll("\<.*?>","")

will work, but things like &wont be converted correctly and non-HTML between the two angle brackets will be removed (i.e. the .*?in the regex will disappear).

将工作，但&不会正确转换和两个尖括号之间的非 HTML 之类的东西将被删除（即.*?正则表达式中的将消失）。

Answer 1

采纳答案by BalusC

Use a HTML parser instead of regex. This is dead simple with Jsoup.

使用 HTML 解析器而不是正则表达式。这对Jsoup 来说非常简单。

public static String html2text(String html) {
    return Jsoup.parse(html).text();
}

Jsoup also supportsremoving HTML tags against a customizable whitelist, which is very useful if you want to allow only e.g. , and .

Jsoup 还支持针对可自定义的白名单删除 HTML 标签，如果您只想允许例如,和.

也可以看看：

Answer 2

回答by Chris Marasti-Georg

If the user enters hey!, do you want to display hey!or hey!? If the first, escape less-thans, and html-encode ampersands (and optionally quotes) and you're fine. A modification to your code to implement the second option would be:

如果用户输入hey!，您是要显示hey!还是要显示hey!？如果第一个，转义小于和 html 编码＆符号（和可选的引号），你就可以了。修改您的代码以实现第二个选项是：

replaceAll("\<[^>]*>","")

but you will run into issues if the user enters something malformed, like <bhey!.

但是如果用户输入格式不正确的内容，例如<bhey!.

You can also check out JTidywhich will parse "dirty" html input, and should give you a way to remove the tags, keeping the text.

您还可以查看JTidy，它将解析“脏”的 html 输入，并且应该为您提供一种删除标签并保留文本的方法。

The problem with trying to strip html is that browsers have very lenient parsers, more lenient than any library you can find will, so even if you do your best to strip all tags (using the replace method above, a DOM library, or JTidy), you will stillneed to make sure to encode any remaining HTML special characters to keep your output safe.

尝试去除 html 的问题在于浏览器具有非常宽松的解析器，比您能找到的任何库都宽松，因此即使您尽力去除所有标签（使用上面的替换方法、DOM 库或 JTidy），您仍然需要确保对任何剩余的 HTML 特殊字符进行编码以确保输出安全。

Answer 3

回答by Tim Howland

HTML Escaping is really hard to do right- I'd definitely suggest using library code to do this, as it's a lot more subtle than you'd think. Check out Apache's StringEscapeUtilsfor a pretty good library for handling this in Java.

HTML 转义真的很难做到正确 - 我绝对建议使用库代码来做到这一点，因为它比您想象的要微妙得多。查看 Apache 的StringEscapeUtils以获得在 Java 中处理此问题的非常好的库。

Answer 4

回答by foxy

You might want to replace  and tags with newlines before stripping the HTML to prevent it becoming an illegible mess as Tim suggests.

您可能希望在剥离 HTML 之前用换行符替换 和标记，以防止它像 Tim 建议的那样变得难以辨认。

The only way I can think of removing HTML tags but leaving non-HTML between angle brackets would be check against a list of HTML tags. Something along these lines...

我能想到删除 HTML 标签但在尖括号之间保留非 HTML 的唯一方法是检查HTML 标签列表。沿着这些路线的东西......

replaceAll("\<[\s]*tag[^>]*>","")

Then HTML-decode special characters such as &. The result should not be considered to be sanitized.

然后 HTML 解码特殊字符，例如&. 结果不应被视为已消毒。

Answer 5

回答by foxy

It sounds like you want to go from HTML to plain text.
If that is the case look at www.htmlparser.org. Here is an example that strips all the tags out from the html file found at a URL.
It makes use of org.htmlparser.beans.StringBean.

听起来您想从 HTML 转换为纯文本。
如果是这种情况，请查看 www.htmlparser.org。这是一个从 URL 中找到的 html 文件中去除所有标签的示例。
它使用org.htmlparser.beans.StringBean。

static public String getUrlContentsAsText(String url) {
    String content = "";
    StringBean stringBean = new StringBean();
    stringBean.setURL(url);
    content = stringBean.getStrings();
    return content;
}

Answer 6

回答by RealHowTo

Another way is to use javax.swing.text.html.HTMLEditorKitto extract the text.

另一种方法是使用 javax.swing.text.html.HTMLEditorKit来提取文本。

import java.io.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Html2Text extends HTMLEditorKit.ParserCallback {
    StringBuffer s;

    public Html2Text() {
    }

    public void parse(Reader in) throws IOException {
        s = new StringBuffer();
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleText(char[] text, int pos) {
        s.append(text);
    }

    public String getText() {
        return s.toString();
    }

    public static void main(String[] args) {
        try {
            // the HTML to convert
            FileReader in = new FileReader("java-new.html");
            Html2Text parser = new Html2Text();
            parser.parse(in);
            in.close();
            System.out.println(parser.getText());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

ref : Remove HTML tags from a file to extract only the TEXT

ref :从文件中删除 HTML 标签以仅提取文本

Answer 7

回答by Mike

Here's a lightly more fleshed out update to try to handle some formatting for breaks and lists. I used Amaya's output as a guide.

这是一个稍微充实的更新，以尝试处理中断和列表的一些格式。我使用 Amaya 的输出作为指导。

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Stack;
import java.util.logging.Logger;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML2Text extends HTMLEditorKit.ParserCallback {
    private static final Logger log = Logger
            .getLogger(Logger.GLOBAL_LOGGER_NAME);

    private StringBuffer stringBuffer;

    private Stack<IndexType> indentStack;

    public static class IndexType {
        public String type;
        public int counter; // used for ordered lists

        public IndexType(String type) {
            this.type = type;
            counter = 0;
        }
    }

    public HTML2Text() {
        stringBuffer = new StringBuffer();
        indentStack = new Stack<IndexType>();
    }

    public static String convert(String html) {
        HTML2Text parser = new HTML2Text();
        Reader in = new StringReader(html);
        try {
            // the HTML to convert
            parser.parse(in);
        } catch (Exception e) {
            log.severe(e.getMessage());
        } finally {
            try {
                in.close();
            } catch (IOException ioe) {
                // this should never happen
            }
        }
        return parser.getText();
    }

    public void parse(Reader in) throws IOException {
        ParserDelegator delegator = new ParserDelegator();
        // the third parameter is TRUE to ignore charset directive
        delegator.parse(in, this, Boolean.TRUE);
    }

    public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("StartTag:" + t.toString());
        if (t.toString().equals("p")) {
            if (stringBuffer.length() > 0
                    && !stringBuffer.substring(stringBuffer.length() - 1)
                            .equals("\n")) {
                newLine();
            }
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.push(new IndexType("ol"));
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.push(new IndexType("ul"));
            newLine();
        } else if (t.toString().equals("li")) {
            IndexType parent = indentStack.peek();
            if (parent.type.equals("ol")) {
                String numberString = "" + (++parent.counter) + ".";
                stringBuffer.append(numberString);
                for (int i = 0; i < (4 - numberString.length()); i++) {
                    stringBuffer.append(" ");
                }
            } else {
                stringBuffer.append("*   ");
            }
            indentStack.push(new IndexType("li"));
        } else if (t.toString().equals("dl")) {
            newLine();
        } else if (t.toString().equals("dt")) {
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.push(new IndexType("dd"));
            newLine();
        }
    }

    private void newLine() {
        stringBuffer.append("\n");
        for (int i = 0; i < indentStack.size(); i++) {
            stringBuffer.append("    ");
        }
    }

    public void handleEndTag(HTML.Tag t, int pos) {
        log.info("EndTag:" + t.toString());
        if (t.toString().equals("p")) {
            newLine();
        } else if (t.toString().equals("ol")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("ul")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("li")) {
            indentStack.pop();
            ;
            newLine();
        } else if (t.toString().equals("dd")) {
            indentStack.pop();
            ;
        }
    }

    public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
        log.info("SimpleTag:" + t.toString());
        if (t.toString().equals("br")) {
            newLine();
        }
    }

    public void handleText(char[] text, int pos) {
        log.info("Text:" + new String(text));
        stringBuffer.append(text);
    }

    public String getText() {
        return stringBuffer.toString();
    }

    public static void main(String args[]) {
        String html = "<html><body><p>paragraph at start</p>hello<br />What is happening?<p>this is a<br />mutiline paragraph</p><ol>  <li>This</li>  <li>is</li>  <li>an</li>  <li>ordered</li>  <li>list    <p>with</p>    <ul>      <li>another</li>      <li>list        <dl>          <dt>This</dt>          <dt>is</dt>            <dd>sdasd</dd>            <dd>sdasda</dd>            <dd>asda              <p>aasdas</p>            </dd>            <dd>sdada</dd>          <dt>fsdfsdfsd</dt>        </dl>        <dl>          <dt>vbcvcvbcvb</dt>          <dt>cvbcvbc</dt>            <dd>vbcbcvbcvb</dd>          <dt>cvbcv</dt>          <dt></dt>        </dl>        <dl>          <dt></dt>        </dl></li>      <li>cool</li>    </ul>    <p>stuff</p>  </li>  <li>cool</li></ol><p></p></body></html>";
        System.out.println(convert(html));
    }
}

Answer 8

回答by rjha94

One more way can be to use com.google.gdata.util.common.html.HtmlToText class like

另一种方法是使用 com.google.gdata.util.common.html.HtmlToText 类

MyWriter.toConsole(HtmlToText.htmlToPlainText(htmlResponse));

This is not bullet proof code though and when I run it on wikipedia entries I am getting style info also. However I believe for small/simple jobs this would be effective.

这不是防弹代码，当我在维基百科条目上运行它时，我也得到了样式信息。但是，我相信对于小型/简单的工作，这将是有效的。

Answer 9

回答by dfrankow

The accepted answer did not work for me for the test case I indicated: the result of "a c" is "a b or b > c".

对于我指出的测试用例，接受的答案对我不起作用：“a c”的结果是“ab 或 b > c”。

So, I used TagSoup instead. Here's a shot that worked for my test case (and a couple of others):

所以，我改用了 TagSoup。这是一个适用于我的测试用例（和其他几个）的镜头：

import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Logger;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;

/**
 * Take HTML and give back the text part while dropping the HTML tags.
 *
 * There is some risk that using TagSoup means we'll permute non-HTML text.
 * However, it seems to work the best so far in test cases.
 *
 * @author dan
 * @see <a href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> 
 */
public class Html2Text2 implements ContentHandler {
private StringBuffer sb;

public Html2Text2() {
}

public void parse(String str) throws IOException, SAXException {
    XMLReader reader = new Parser();
    reader.setContentHandler(this);
    sb = new StringBuffer();
    reader.parse(new InputSource(new StringReader(str)));
}

public String getText() {
    return sb.toString();
}

@Override
public void characters(char[] ch, int start, int length)
    throws SAXException {
    for (int idx = 0; idx < length; idx++) {
    sb.append(ch[idx+start]);
    }
}

@Override
public void ignorableWhitespace(char[] ch, int start, int length)
    throws SAXException {
    sb.append(ch);
}

// The methods below do not contribute to the text
@Override
public void endDocument() throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
    throws SAXException {
}

@Override
public void endPrefixMapping(String prefix) throws SAXException {
}


@Override
public void processingInstruction(String target, String data)
    throws SAXException {
}

@Override
public void setDocumentLocator(Locator locator) {
}

@Override
public void skippedEntity(String name) throws SAXException {
}

@Override
public void startDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
    Attributes atts) throws SAXException {
}

@Override
public void startPrefixMapping(String prefix, String uri)
    throws SAXException {
}
}

Answer 10

回答by Serge

I think that the simpliest way to filter the html tags is:

我认为过滤 html 标签的最简单方法是：

private static final Pattern REMOVE_TAGS = Pattern.compile("<.+?>");

public static String removeTags(String string) {
    if (string == null || string.length() == 0) {
        return string;
    }

    Matcher m = REMOVE_TAGS.matcher(string);
    return m.replaceAll("");
}

Java 从字符串中删除 HTML 标签

提问by Mason

采纳答案by BalusC

See also:

也可以看看：

回答by Chris Marasti-Georg

回答by Tim Howland

回答by foxy

回答by foxy

回答by RealHowTo

回答by Mike

回答by rjha94

回答by dfrankow

回答by Serge

相关推荐

最近更新

标签

Java 从字符串中删除 HTML 标签

提问by Mason

采纳答案by BalusC

See also:

也可以看看：

回答by Chris Marasti-Georg

回答by Tim Howland

回答by foxy

回答by foxy

回答by RealHowTo

回答by Mike

回答by rjha94

回答by dfrankow

回答by Serge

相关推荐

Java 使用 FileInputStream 时如何确定理想的缓冲区大小？

Java SSL 致命错误 - 握手失败 (40)

Java HTTP 状态 403 - 在请求参数上发现无效的 CSRF 令牌“null”

在 Java 中使用正则表达式提取值

相关推荐

最近更新

标签