java 如何提取 HTML 标签以仅获取某些信息?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/15077801/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-10-31 18:24:35  来源:igfitidea点击:

How to extract HTML tags to get only certain information?

javahtmlstringextraction

提问by art3m1sm00n

I need to extract the webpage's title from between the <title> </title>tags.

我需要从<title> </title>标签之间提取网页的标题。

Also need to display all of the text located between the <body...>and </body>tags but nothing outside the body.

还需要显示位于<body...></body>标签之间的所有文本,但在正文之外没有任何内容。

I don't want any angle brackets or any of the html data returned.

我不想要任何尖括号或任何返回的 html 数据。

回答by Igor Rodriguez

You can use something like:

你可以使用类似的东西:

String html = "<html>My page</html>";
String title = html.substring(html.indexOf("<html>") + 6, html.indexOf("</html"));
System.out.println(title);

The String.indexOf(string)method returns the start index of a string (in the example, "<html>"and "</html>") in the given string (the variable html).

所述String.indexOf(字符串)方法返回字符串的起始索引(在本例中,"<html>""</html>"给定的字符串中)(可变HTML)。

The String.substring(int, int)method returns the string between 2 indexes.

所述String.substring(INT,INT)方法返回2个索引之间的字符串。

With this, you can start your browser.

有了这个,您就可以启动浏览器了。

回答by T.J. Crowder

To simplify my question, how do I search through a giant string to find another string and record its location?

为了简化我的问题,我如何搜索一个巨大的字符串以找到另一个字符串并记录它的位置?

String#indexOf(String)For instance:

String#indexOf(String)例如:

int index = bigString.indexOf("<body");

...finds the first occurence of <bodyin bigStringand returns its index (which you could use with substring). But if you're not sure how to do that, the assignment is nuts. The course should have properly prepared you for this task, and it seems like it hasn't.

...找到<bodyin的第一次出现bigString并返回其索引(您可以将其与 一起使用substring)。但是,如果您不确定如何做到这一点,那么这项任务就是疯狂的。本课程应该已经为您完成这项任务做好了适当的准备,但似乎没有。

Parsing HTML is complicated. You can do a half-complete, incorrect job using indexOfand substring, but it will be...half-complete and incorrect.

解析 HTML 很复杂。您可以使用indexOf和完成半完成、不正确的工作substring,但这将是......半完成且不正确。

回答by Jason Sperske

There are lot of ways to approach this problem but using the constraints you have presented lets take a low level approach. First assuming you have received this entire HTML document into a string called html. The first task will be to search for ''. There is a lot of error checking that this answer will not cover, but then we can't do all of your homework for you :P, so we will assume that the titleelements are in lowercase and well formed:

有很多方法可以解决这个问题,但是使用您提供的约束让我们采用低级方法。首先假设您已将整个 HTML 文档接收到一个名为html. 第一个任务是搜索“”。这个答案不会涵盖很多错误检查,但是我们无法为您完成所有作业:P,因此我们将假设title元素为小写且格式正确:

First we need to determin where in the HTML is the title element (here I am using indexOf())

首先,我们需要确定 HTML 中标题元素的位置(这里我使用的是indexOf()

int start = html.indexOf("<title>")+"<title>".length();
int end = html.indexOf("</title>", start);

Then to extract it into a string (using substring()):

然后将其提取为字符串(使用substring()):

String title = html.substring(start, end);

回答by Javier

From your description you don't need to parse the complete HTML documen't, but only extract some information from it. An approach based on a Finite State Machinewill work.

从您的描述中,您不需要解析完整的 HTML 文档,而只需从中提取一些信息。基于有限状态机的方法将起作用。

Scan until you find a <title>element. From that point anything is data, until you find a closing </title>. Then scan until you find an opening <body>. From that point you will need to read the "content" skipping anything that is between <and >, which may be done as follow:

扫描直到找到一个<title>元素。从那时起,任何东西都是数据,直到你找到一个结束的</title>. 然后扫描直到找到开口<body>。从那时起,您将需要阅读“内容”,跳过<和之间的任何内容>,可以按如下方式完成:

//input stream in is just after <body>
String body=""; 
String element="";
boolean ignore=false
while (true) {
  char c = in.read();
  if (c<0) break; //end of stream
  if (ignore) {
    if (c=='>') {
      if (element.equals("/body")) break; //closing </body>
      ignore=false;
    }
    else element+=c;
  }  else {
    //not in ignore mode
    if (c=='<') {element=""; ignore=true;}
    else body+=c;
  }

回答by christopher

There are two developmental phases programmers use to solve these sorts of problems:

程序员使用两个发展阶段来解决这类问题:

1. Parse out the data yourself:

1.自己解析数据:

In HTML (good HTML) most tags are followed by closing tags. A <title>tag is one of them. If you're trying to find what is in between them, find the index of <title>. You probably want the index of the last >just for ease.

在 HTML(好的 HTML)中,大多数标签后面都跟有结束标签。一个<title>标签就是其中之一。如果您要查找它们之间的内容,请找到 的索引<title>。您可能想要最后一个的索引>只是为了方便。

Then while current character is not <, add that character to a string.

然后当当前字符不是 时<,将该字符添加到字符串中。

When you hit a <, you should check if it's </title>. If not, continue reading. Essentially you keep looping. Each time you hit a <check if it's a closing title tag.

当你击中 a 时<,你应该检查它是否是</title>。如果没有,请继续阅读。基本上你一直在循环。每次您点击<检查它是否是结束标题标签时。

When you realize that this is super hard and re-inventing the wheel, advance to step 2:

当您意识到这非常困难并重新发明轮子时,请前进到第 2 步:

2. Use a DOM parser library.

2. 使用 DOM 解析器库。

After you have hurt yourself trying to do step 1. You discover why programmers strongly advise you never parse HTML or use regex on HTML. Realize the battle has already been fought and won with battletested HTML parsers: What are the pros and cons of the leading Java HTML parsers?

在尝试执行第 1 步伤害自己之后,您会发现为什么程序员强烈建议您永远不要解析 HTML 或在 HTML 上使用正则表达式。意识到与经过实战考验的 HTML 解析器的战斗已经打赢了:领先的 Java HTML 解析器的优缺点是什么?