java JSoup 解析 HTML
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/7830972/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
JSoup parsing HTML
提问by Lars
I am trying to parse a non well formed DTD html file which i retrieve by a inputstream with JSOUP, and get all the data in the TD fields. How can i do that with JSoup? I already looked at the http://jsoup.org/cookbook/but i should need som example to get it started.
我正在尝试解析一个格式不正确的 DTD html 文件,我使用 JSOUP 通过输入流检索该文件,并获取 TD 字段中的所有数据。我怎样才能用 JSoup 做到这一点?我已经看过http://jsoup.org/cookbook/但我应该需要一些例子来开始它。
Thank you in advance.
先感谢您。
I already tried the saxparser but i can`t get the DTD to work.
我已经尝试过 saxparser,但我无法让 DTD 工作。
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="nl" lang="nl">
<TABLE class=personaltable cellSpacing=0 cellPadding=0>
<TBODY>
<TR class=alternativerow>
<TD>Nieuw beltegoed:</TD>
<TD> 1,00</TD></TR>
<TR>
<TD>Tegoed vorige periode:
<TD> 2,00</TD></TD></TR>
<TR class=alternativerow>
<TD>Tegoed tot 09-11-2011:
<TD> 10,00</TD></TD></TR>
<TR>
<TD>
<TD height=25></TD>
<TR class=alternativerow>
<TD>Verbruik sinds nieuw tegoed:</TD>
<TD> 0,33</TD></TR>
<TR>
<TD>Ongebruikt tegoed:</TD>
<TD> 12,00</TD></TR>
<TR class=alternativerow>
<TD class=f-Orange>Verbruik boven bundel:</TD>
<TD class=f-Orange> 0,00</TD></TR>
<TR>
<TD>Verbruik dat niet in de bundel zit*:</TD>
<TD> 0,00</TD></TR>
</TBODY>
</TABLE>
</html>
Edit: I am getting a force close, i need the JSoup in my AsyncTask. Here is the LOgcat:
编辑:我正在强制关闭,我需要 AsyncTask 中的 JSoup。这是 LOGcat:
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): FATAL EXCEPTION: main
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): java.lang.NullPointerException
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.sencide.AndroidLogin$MyTask.onPostExecute(AndroidLogin.java:276)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.sencide.AndroidLogin$MyTask.onPostExecute(AndroidLogin.java:1)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.AsyncTask.finish(AsyncTask.java:417)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.AsyncTask.access0(AsyncTask.java:127)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.AsyncTask$InternalHandler.handleMessage(AsyncTask.java:429)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.Handler.dispatchMessage(Handler.java:99)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.os.Looper.loop(Looper.java:130)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at android.app.ActivityThread.main(ActivityThread.java:3835)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at java.lang.reflect.Method.invokeNative(Native Method)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at java.lang.reflect.Method.invoke(Method.java:507)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:847)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:605)
10-20 21:07:36.679: ERROR/AndroidRuntime(1396): at dalvik.system.NativeStart.main(Native Method)
Here is the AsyncTask code:
这是 AsyncTask 代码:
public class MyTask extends AsyncTask<String, Integer, String> {
private Elements tdsFromSecondColumn=null;
}
protected String doInBackground(String... params) {
InputStream inputStreamActivity = response.getEntity().getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStreamActivity));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line + "\n");
}
/******* CLOSE CONNECTION AND STREAM *******/
System.out.println(sb);
inputStreamActivity.close();
String kpn;
kpn = sb.toString();
Document doc = Jsoup.parse(kpn);
Elements tdsFromSecondColumn = doc.select("table.personaltable td:eq(1)");
}
@Override
protected void onPostExecute(String result) {
//publishProgress(false);
TextView tv = (TextView)findViewById(R.id.lbl_top);
for (Element tdFromSecondColumn : tdsFromSecondColumn) {
//System.out.println(tdFromSecondColumn.text());
tv.setText("");
tv.setText(tdFromSecondColumn.text());
}
}
}
回答by BalusC
So, you have an InputStream
and not an URL? You should then use the Jsoup#parse()
method which takes an InputStream
:
所以,你有一个InputStream
而不是一个 URL?然后,您应该使用Jsoup#parse()
带有的方法InputStream
:
Document document = Jsoup.parse(inputStream, charsetName, baseUri);
// ...
The charsetName
should be the charset the document is originally encoded in. You can leave it null
to let Jsoup decide or fallback to UTF-8. The baseUri
should be the URL from which the HTML was originally served. You can leave it null
, you'll only not be able to resolve relative links.
本charsetName
应该是文档在原始编码的字符集。你可以离开它null
让Jsoup决定还是回退到UTF-8。本baseUri
应该从该HTML最初服务的URL。你可以离开它null
,你只会无法解析相对链接。
But if you actually have the original URL, then you could also just use Jsoup#connect()
:
但是,如果您确实拥有原始 URL,那么您也可以使用Jsoup#connect()
:
Document document = Jsoup.connect(url).get();
// ...
Regardless of the way you obtained the Document
, you can use CSS selectorsto select elements of interest in the document. See also the Jsoup cookbook on that subject. Here's an example which extracts all the data from the 2nd column of the <table>
with a class name of personaltable
:
无论您以何种方式获得Document
,您都可以使用CSS 选择器来选择文档中感兴趣的元素。另请参阅有关该主题的Jsoup 食谱。这是一个示例,它从<table>
类名为的第二列中提取所有数据personaltable
:
Elements tdsFromSecondColumn = document.select("table.personaltable td:eq(1)");
for (Element tdFromSecondColumn : tdsFromSecondColumn) {
System.out.println(tdFromSecondColumn.text());
}
which results in:
这导致:
1,00
2,00
10,00
0,33
12,00
0,00
0,00