从 HTML Java 中提取文本
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1386107/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Text Extraction from HTML Java
提问by
I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file.
我正在开发一个下载 HTML 页面然后选择一些信息并将其写入另一个文件的程序。
I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows;
我想提取段落标签之间的信息,但我只能得到段落的一行。我的代码如下;
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
try {
out.write(s);
} catch (IOException e) {
}
}
}
i was trying to add another while loop, which would tell the program to keep writing to file until the line contains the </p>
tag, by saying;
我试图添加另一个 while 循环,它会告诉程序继续写入文件,直到该行包含</p>
标记为止;
while ((s = br.readLine()) !=null) {
if(s.contains("<p>")) {
while(!s.contains("</p>") {
try {
out.write(s);
} catch (IOException e) {
}
}
}
}
But this doesn't work. Could someone please help.
但这不起作用。有人可以帮忙吗。
回答by Niall
Try (if you don't want to use a HTML parser library):
尝试(如果您不想使用 HTML 解析器库):
FileReader fileReader = new FileReader(file);
BufferedReader buffRd = new BufferedReader(fileReader);
BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt));
String s;
int writeTo = 0;
while ((s = br.readLine()) !=null)
{
if(s.contains("<p>"))
{
writeTo = 1;
try
{
out.write(s);
}
catch (IOException e)
{
}
}
if(s.contains("</p>"))
{
writeTo = 0;
try
{
out.write(s);
}
catch (IOException e)
{
}
}
else if(writeTo==1)
{
try
{
out.write(s);
}
catch (IOException e)
{
}
}
}
回答by Gareth Davis
回答by skaffman
回答by brianary
You may just be using the wrong tool for the job:
您可能只是使用了错误的工具来完成这项工作:
perl -ne "print if m|<p>| .. m|</p>|" infile.txt >outfile.txt
回答by Billy Bob Bain
I've had success using TagSoup & XPath to parse HTML.
我已经成功地使用 TagSoup 和 XPath 来解析 HTML。
回答by camickr
Use a ParserCallback. Its a simple class thats included with the JDK. It notifies you every time a new tag is found and then you can extract the text of the tag. Simple example:
使用 ParserCallback。它是一个包含在 JDK 中的简单类。每次发现新标签时它都会通知您,然后您可以提取标签的文本。简单的例子:
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class ParserCallbackTest extends HTMLEditorKit.ParserCallback
{
private int tabLevel = 1;
private int line = 1;
public void handleComment(char[] data, int pos)
{
displayData(new String(data));
}
public void handleEndOfLineString(String eol)
{
System.out.println( line++ );
}
public void handleEndTag(HTML.Tag tag, int pos)
{
tabLevel--;
displayData("/" + tag);
}
public void handleError(String errorMsg, int pos)
{
displayData(pos + ":" + errorMsg);
}
public void handleMutableTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
displayData("mutable:" + tag + ": " + pos + ": " + a);
}
public void handleSimpleTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
displayData( tag + "::" + a );
// tabLevel++;
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
displayData( tag + ":" + a );
tabLevel++;
}
public void handleText(char[] data, int pos)
{
displayData( new String(data) );
}
private void displayData(String text)
{
for (int i = 0; i < tabLevel; i++)
System.out.print("\t");
System.out.println(text);
}
public static void main(String[] args)
throws IOException
{
ParserCallbackTest parser = new ParserCallbackTest();
// args[0] is the file to parse
Reader reader = new FileReader(args[0]);
// URLConnection conn = new URL(args[0]).openConnection();
// Reader reader = new InputStreamReader(conn.getInputStream());
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
}
}
So all you need to do is set a boolean flag when the paragraph tag is found. Then in the handleText() method you extract the text.
所以你需要做的就是在找到段落标签时设置一个布尔标志。然后在 handleText() 方法中提取文本。
回答by Danny
jsoup
汤
Another html parser I really liked using was jsoup. You could get all the <p>
elements in 2 lines of code.
我非常喜欢使用的另一个 html 解析器是jsoup。您可以<p>
在 2 行代码中获取所有元素。
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements ps = doc.select("p");
Then write it out to a file in one more line
然后再写一行到一个文件中
out.write(ps.text()); //it will append all of the p elements together in one long string
or if you want them on separate lines you can iterate through the elements and write them out separately.
或者,如果您希望它们在单独的行上,您可以遍历元素并单独写出它们。
回答by Consultant
Try this.
尝试这个。
public static void main( String[] args )
{
String url = "http://en.wikipedia.org/wiki/Big_data";
Document document;
try {
document = Jsoup.connect(url).get();
Elements paragraphs = document.select("p");
Element firstParagraph = paragraphs.first();
Element lastParagraph = paragraphs.last();
Element p;
int i=1;
p=firstParagraph;
System.out.println("* " +p.text());
while (p!=lastParagraph){
p=paragraphs.get(i);
System.out.println("* " +p.text());
i++;
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}