从 HTML Java 中提取文本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1386107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 11:28:43  来源:igfitidea点击:

Text Extraction from HTML Java

javahtmlscreen-scrapinghtml-content-extractiontext-extraction

提问by

I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.

我正在开发一个下载 HTML 页面然后选择一些信息并将其写入另一个文件的程序。

I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;

我想提取段落标签之间的信息,但我只能得到段落的一行。我的代码如下;

FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        try {
            out.write(s);
        } catch (IOException e) {
        }
    }
}

i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p>tag, by saying;

我试图添加另一个 while 循环,它会告诉程序继续写入文件,直到该行包含</p>标记为止;

while ((s = br.readLine()) !=null) {
    if(s.contains("<p>")) {
        while(!s.contains("</p>") {
            try {
                out.write(s);
            } catch (IOException e) {
            }
        }
    }
}

But this doesn't work. Could someone please help.

但这不起作用。有人可以帮忙吗。

回答by Niall

Try (if you don't want to use a HTML parser library):

尝试(如果您不想使用 HTML 解析器库):


        FileReader fileReader = new FileReader(file);
        BufferedReader buffRd = new BufferedReader(fileReader);
        BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
        String s;
        int writeTo = 0;
        while ((s = br.readLine()) !=null) 
        {
                if(s.contains("<p>"))
                {
                        writeTo = 1;

                        try 
                        {
                            out.write(s);
                    } 
                        catch (IOException e) 
                        {

                    }
                }
                if(s.contains("</p>"))
                {
                        writeTo = 0;

                        try 
                        {
                            out.write(s);
                    } 
                        catch (IOException e) 
                        {

                    }
                }
                else if(writeTo==1)
                {
                        try 
                        {
                            out.write(s);
                    } 
                        catch (IOException e) 
                        {

                    }
                }
}

回答by Gareth Davis

jerichois one of several posible html parsers that could make this task both easy and safe.

jericho是几种可能的 html 解析器之一,可以使此任务既简单又安全。

回答by skaffman

JTidycan represent an HTML document (even a malformed one) as a document model, making the process of extracting the contents of a <p>tag a rather more elegant process than manually thunking through the raw text.

JTidy可以将 HTML 文档(甚至是格式错误的文档)表示为文档模型,这使得提取<p>标签内容的过程比手动处理原始文本更加优雅。

回答by brianary

You may just be using the wrong tool for the job:

您可能只是使用了错误的工具来完成这项工作:

perl -ne "print if m|<p>| .. m|</p>|" infile.txt >outfile.txt

回答by Billy Bob Bain

I've had success using TagSoup & XPath to parse HTML.

我已经成功地使用 TagSoup 和 XPath 来解析 HTML。

http://home.ccil.org/~cowan/XML/tagsoup/

http://home.ccil.org/~cowan/XML/tagsoup/

回答by camickr

Use a ParserCallback. Its a simple class thats included with the JDK. It notifies you every time a new tag is found and then you can extract the text of the tag. Simple example:

使用 ParserCallback。它是一个包含在 JDK 中的简单类。每次发现新标签时它都会通知您,然后您可以提取标签的文本。简单的例子:

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class ParserCallbackTest extends HTMLEditorKit.ParserCallback
{
    private int tabLevel = 1;
    private int line = 1;

    public void handleComment(char[] data, int pos)
    {
        displayData(new String(data));
    }

    public void handleEndOfLineString(String eol)
    {
        System.out.println( line++ );
    }

    public void handleEndTag(HTML.Tag tag, int pos)
    {
        tabLevel--;
        displayData("/" + tag);
    }

    public void handleError(String errorMsg, int pos)
    {
        displayData(pos + ":" + errorMsg);
    }

    public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        displayData("mutable:" + tag + ": " + pos + ": " + a);
    }

    public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        displayData( tag + "::" + a );
//      tabLevel++;
    }

    public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
    {
        displayData( tag + ":" + a );
        tabLevel++;
    }

    public void handleText(char[] data, int pos)
    {
        displayData( new String(data) );
    }

    private void displayData(String text)
    {
        for (int i = 0; i < tabLevel; i++)
            System.out.print("\t");

        System.out.println(text);
    }

    public static void main(String[] args)
    throws IOException
    {
        ParserCallbackTest parser = new ParserCallbackTest();

        // args[0] is the file to parse

        Reader reader = new FileReader(args[0]);
//      URLConnection conn = new URL(args[0]).openConnection();
//      Reader reader = new InputStreamReader(conn.getInputStream());

        try
        {
            new ParserDelegator().parse(reader, parser, true);
        }
        catch (IOException e)
        {
            System.out.println(e);
        }
    }
}

So all you need to do is set a boolean flag when the paragraph tag is found. Then in the handleText() method you extract the text.

所以你需要做的就是在找到段落标签时设置一个布尔标志。然后在 handleText() 方法中提取文本。

回答by Danny

jsoup

Another html parser I really liked using was jsoup. You could get all the <p>elements in 2 lines of code.

我非常喜欢使用的另一个 html 解析器是jsoup。您可以<p>在 2 行代码中获取所有元素。

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");

Then write it out to a file in one more line

然后再写一行到一个文件中

out.write(ps.text());  //it will append all of the p elements together in one long string

or if you want them on separate lines you can iterate through the elements and write them out separately.

或者,如果您希望它们在单独的行上,您可以遍历元素并单独写出它们。

回答by Consultant

Try this.

尝试这个。

 public static void main( String[] args )
{
    String url = "http://en.wikipedia.org/wiki/Big_data";

    Document document;
    try {
        document = Jsoup.connect(url).get();
        Elements paragraphs = document.select("p");

        Element firstParagraph = paragraphs.first();
        Element lastParagraph = paragraphs.last();
        Element p;
        int i=1;
        p=firstParagraph;
        System.out.println("*  " +p.text());
        while (p!=lastParagraph){
            p=paragraphs.get(i);
            System.out.println("*  " +p.text());
            i++;
        } 
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
}