Java 如何查找String是否包含html数据?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3052052/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-13 16:00:48  来源:igfitidea点击:

How to find if String contains html data?

javahtml

提问by Joe

How do I find if a string contains HTML data or not? The user provides input via web interface and it's quite possible he could have used either a simple text or used HTML formatting.

如何查找字符串是否包含 HTML 数据?用户通过 Web 界面提供输入,他很可能使用简单的文本或 HTML 格式。

采纳答案by Tom Gullen

You can use regular expressions to search for HTML tags.

您可以使用正则表达式来搜索 HTML 标签。

回答by pakore

In your backing bean, you can try to find html tags such as <b>or <i>, etc... You can use regular expressions (slow) or just try to find the "<>" chars. It depends on how sure you want to be that the user used html or not.

在您的支持 bean 中,您可以尝试查找 html 标记,例如<b><i>等...您可以使用正则表达式(慢)或尝试查找“<>”字符。这取决于您希望用户是否使用 html 的确定性。

Keep in mind that the user could write <asdf>. If you want to be 100% sure that the html used is valid you will need to use a complex html parser from some library (TidyHTML maybe?)

请记住,用户可以编写<asdf>. 如果您想 100% 确定所使用的 html 是有效的,您将需要使用某个库中的复杂 html 解析器(可能是 TidyHTML?)

回答by 1s2a3n4j5e6e7v

You have to get help only by the regular expression strings. They help you find out potential html tags. You can then compare the inner to contain any html keywords. If its found, put up an alert telling not to use HTML. Or simply delete it if you feel otherwise.

您只能通过正则表达式字符串获得帮助。它们可以帮助您找出潜在的 html 标签。然后,您可以比较内部以包含任何 html 关键字。如果找到,则发出警告,告知不要使用 HTML。或者,如果您感觉不同,则干脆删除它。

回答by Tom Gullen

If you don't want the user to have HTML in their input, you can replace all '<' characters with their HTML entity equivalent, '& lt;' and all '>' with '& gt;' (no spaces between & and g)

如果您不希望用户在其输入中包含 HTML,您可以将所有 '<' 字符替换为其等效的 HTML 实体 '& lt;' 以及所有带有 '& gt;' 的 '>' (& 和 g 之间没有空格)

回答by David H. Bennett

I know this is an old question but I ran into it and was looking for something more comprehensive that could detect things like HTML entities and would ignore other uses of < and > symbols. I came up with the following class that works well.

我知道这是一个老问题,但我遇到了它,并正在寻找更全面的东西,可以检测 HTML 实体之类的东西,并且会忽略 < 和 > 符号的其他用途。我想出了以下效果很好的课程。

You can play with it live at http://ideone.com/HakdHo

你可以在http://ideone.com/HakdHo现场直播

I also uploaded this to GitHubwith a bunch of JUnit tests.

我还使用一堆 JUnit 测试将其上传到GitHub

package org.github;

/**
 * Detect HTML markup in a string
 * This will detect tags or entities
 *
 * @author [email protected] - David H. Bennett
 *
 */

import java.util.regex.Pattern;

public class DetectHtml
{
    // adapted from post by Phil Haack and modified to match better
    public final static String tagStart=
        "\<\w+((\s+\w+(\s*\=\s*(?:\".*?\"|'.*?'|[^'\"\>\s]+))?)+\s*|\s*)\>";
    public final static String tagEnd=
        "\</\w+\>";
    public final static String tagSelfClosing=
        "\<\w+((\s+\w+(\s*\=\s*(?:\".*?\"|'.*?'|[^'\"\>\s]+))?)+\s*|\s*)/\>";
    public final static String htmlEntity=
        "&[a-zA-Z][a-zA-Z0-9]+;";
    public final static Pattern htmlPattern=Pattern.compile(
      "("+tagStart+".*"+tagEnd+")|("+tagSelfClosing+")|("+htmlEntity+")",
      Pattern.DOTALL
    );

    /**
     * Will return true if s contains HTML markup tags or entities.
     *
     * @param s String to test
     * @return true if string contains HTML
     */
    public static boolean isHtml(String s) {
        boolean ret=false;
        if (s != null) {
            ret=htmlPattern.matcher(s).find();
        }
        return ret;
    }

}

回答by Pawe? Skorupiński

I'm using regex:

我正在使用正则表达式:

[\S\s]*\<html[\S\s]*\>[\S\s]*\<\/html[\S\s]*\>[\S\s]*

[\S\s]*\<html[\S\s]*\>[\S\s]*\<\/html[\S\s]*\>[\S\s]*

So in JAVA it looks like:

所以在JAVA中它看起来像:

text.matches("[\\S\\s]*\\<html[\\S\\s]*\>[\\S\\s]*\\<\\/html[\\S\\s]*\\>[\S\s]*");

text.matches("[\\S\\s]*\\<html[\\S\\s]*\>[\\S\\s]*\\<\\/html[\\S\\s]*\\>[\S\s]*");

It should match any correct (as well as some incorrect) XML file that contains somewhere an "html" element. So there might be false positives.

它应该匹配包含某处“html”元素的任何正确(以及一些不正确)的 XML 文件。所以可能存在误报。

Edit:

编辑:

Since I have posted that, I have removed the last part with html element closing, as I found some websites don't use it. (?!) So in case, you prefer false positives to false negatives, I encourage to do that!

自从我发布了它,我删除了最后一部分 html 元素关闭,因为我发现有些网站不使用它。(?!) 因此,如果您更喜欢误报而不是漏报,我鼓励这样做!

回答by Gorky

Below will match any tags. You can also extract tag, attributes and value

下面将匹配任何标签。您还可以提取标签、属性和值

    Pattern pattern = Pattern.compile("<(\w+)( +.+)*>((.*))</\1>");
    Matcher matcher = pattern.matcher("<as testAttr='5'> TEST</as>");
    if (matcher.find()) {
        for (int i = 0; i < matcher.groupCount(); i++) {
            System.out.println(i + ":" + matcher.group(i));
        }
    }