Java 如何使用jsoup解析HTML表格?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24772828/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to parse HTML table using jsoup?
提问by john
I am trying to parse HTML using jsoup. This is my first time working with jsoup and I read some tutorial on it as well. Below is my HTML table which I am trying to parse -
我正在尝试使用 jsoup 解析 HTML。这是我第一次使用 jsoup,我也阅读了一些关于它的教程。下面是我试图解析的 HTML 表 -
If you see my below table, it has three tr
as of now (I have shorten it down to have three table rows just for understanding purpose but in general it will be more). Now I would like to extract Cluster Name
from my below table and it's corresponding host name
so for example - I would extract Titan
as cluster name and all its hostname whose status are down.
如果你看到我的下表,它tr
现在有三个(为了理解目的,我将它缩短为三个表行,但总的来说它会更多)。现在我想Cluster Name
从我的下表中提取并且它是对应的host name
,例如 - 我将提取Titan
为集群名称及其所有状态为 down 的主机名。
As you can see below for Titan
cluster name, I have two hostnames machineA.abc.com
and machineB.abc.com
in which machineA
status is up
but machineB
status is down
.
正如您在下面看到的Titan
集群名称,我有两个主机名machineA.abc.com
,machineB.abc.com
其中machineA
status 是up
但machineB
status 是down
。
So I will print out Titan
as cluster name and print out machineB.abc.com
as the hostname since it is down. Is this possible to do using jsoup?
所以我将打印Titan
为集群名称并打印machineB.abc.com
为主机名,因为它已关闭。这可以使用 jsoup 吗?
<table border=1>
<tr>
<td> </td>
<td> </td>
<td>Alert</td>
<td>Cluster Name</td>
<td>IP addr</td>
<td>Host Name</td>
<td>Type</td>
<td>Status</td>
<td>Free</td>
<td>Version</td>
<td>Restart Time</td>
<td>UpTime(Days)</td>
<td>Last probed</td>
<td>Last up</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Titan</td>
<td>10.100.111.77</td>
<td>machineA.abc.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td></td>
<td>10.200.192.99</td>
<td>machineB.abc.com</td>
<td></td>
<td bgcolor="ffffff">down</td>
<td bgcolor="ffffff" align=right>85%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
<td bgcolor="ffffff" align=right>103</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
</table>
So far, I am able to extract whole HTML table using jsoup but not sure how would I extract cluster name and the hostnames which are down -
到目前为止,我能够使用 jsoup 提取整个 HTML 表,但不确定如何提取集群名称和已关闭的主机名 -
URL url = new URL("url_name");
Document doc = Jsoup.parse(url, 3000);
Update:-
更新:-
I might have two cluster name in the table as shown below -
我可能在表中有两个集群名称,如下所示 -
<table border=1>
<tr>
<td> </td>
<td> </td>
<td>Alert</td>
<td>Cluster Name</td>
<td>IP addr</td>
<td>Host Name</td>
<td>Type</td>
<td>Status</td>
<td>Free</td>
<td>Version</td>
<td>Restart Time</td>
<td>UpTime(Days)</td>
<td>Last probed</td>
<td>Last up</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Titan</td>
<td>10.100.111.77</td>
<td>machineA.abc.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td></td>
<td>10.200.192.99</td>
<td>machineB.abc.com</td>
<td></td>
<td bgcolor="ffffff">down</td>
<td bgcolor="ffffff" align=right>85%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:52:20,613</td>
<td bgcolor="ffffff" align=right>103</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
<tr bgcolor="ffffff">
<td><a href=showlog?ip_addr=127.0.0.1>Hist</a></td>
<td><a href=http://127.0.0.1:8080/test?full=y>VI</a></td>
<td bgcolor="ffffff"> </td>
<td>Goldy</td>
<td>10.100.111.77</td>
<td>machineH.pqr.com</td>
<td></td>
<td bgcolor="ffffff">up</td>
<td bgcolor="ffffff" align=right>88%</td>
<td bgcolor="ffffff">2.0.5-SNAPSHOT</td>
<td bgcolor="ffffff">2014-07-04 01:49:08,220</td>
<td bgcolor="ffffff" align=right>381</td>
<td>07-14 20:01:59</td>
<td>07-14 20:01:59</td>
</tr>
</table>
Now if you see above I have two cluster name - one is Titan
and other is Goldy
so I want to find all the machines which are down for Titan
cluster name only.
现在,如果您在上面看到我有两个集群名称 - 一个是Titan
,另一个是,Goldy
所以我想找到所有Titan
仅因集群名称而停机的机器。
采纳答案by user2640782
Yes, it is possible with JSoup. First, you select the table. Then, you select the <tr>
tags for rows. You can start from the second index since the first row contains only the column names. Then loop over the <th>
tags and get the specific index. In your case, the indexes 7 and 5 are important(index 7: Status, index 5: Host Name). Check the status if it equals to down
and if it is, then add the Host Name to a list. That's all.
是的,JSoup 是可能的。首先,您选择表。然后,您选择<tr>
行的标签。您可以从第二个索引开始,因为第一行只包含列名。然后遍历<th>
标签并获取特定索引。在您的情况下,索引 7 和 5 很重要(索引 7:状态,索引 5:主机名)。检查状态是否等于down
,如果等于,则将主机名添加到列表中。就这样。
ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");
for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
Element row = rows.get(i);
Elements cols = row.select("td");
if (cols.get(7).text().equals("down")) {
downServers.add(cols.get(5).text());
}
}
Update:When you find the word Titan
you can create another loop and look if the cluster name is empty.
更新:当您找到该词时,Titan
您可以创建另一个循环并查看集群名称是否为空。
Edit:I change the while
loop to do while
loop.
编辑:我将while
循环更改为do while
循环。
ArrayList<String> downServers = new ArrayList<>();
Element table = doc.select("table").get(0); //select the first table.
Elements rows = table.select("tr");
for (int i = 1; i < rows.size(); i++) { //first row is the col names so skip it.
Element row = rows.get(i);
Elements cols = row.select("td");
if (cols.get(3).text().equals("Titan")) {
if (cols.get(7).text().equals("down"))
downServers.add(cols.get(5).text());
do {
if(i < rows.size() - 1)
i++;
row = rows.get(i);
cols = row.select("td");
if (cols.get(7).text().equals("down") && cols.get(3).text().equals("")) {
downServers.add(cols.get(5).text());
}
if(i == rows.size() - 1)
break;
}
while (cols.get(3).text().equals(""));
i--; //if there is two Titan names consecutively.
}
}
downServers ArrayList will contain the list of down servers hostnames.
downServers ArrayList 将包含停机服务器主机名列表。
回答by MaVRoSCy
What I would do in your case is first create an Object of your machine with all apropriate attributes. Then using Jsoup I would extract data and create an ArrayList, and then use logic to get data from the Arraylist.
在您的情况下,我会首先创建一个具有所有适当属性的机器对象。然后使用 Jsoup 我会提取数据并创建一个 ArrayList,然后使用逻辑从 Arraylist 中获取数据。
I am skipping the Object creation (since it is not the issue here) and I will name the Object as Machine
我正在跳过对象创建(因为这不是这里的问题),我将对象命名为 Machine
Then using Jsoup I would get the row data like this:
然后使用 Jsoup 我会得到这样的行数据:
ArrayList<Machine> list = new ArrayList();
Document doc = Jsoup.parse(url, 3000);
for (Element table : doc.select("table")) { //this will work if your doc contains only one table element
for (Element row : table.select("tr")) {
Machine tmp = new Machine();
Elements tds = row.select("td");
tmp.setClusterName(tds.get(3).text());
tmp.setIp(tds.get(4).text());
tmp.setStatus(tds.get(7).text());
//.... and so on for the rest of attributes
list.add(tmp);
}
}
Then use a loop to get the values you need from the list:
然后使用循环从列表中获取您需要的值:
for(Machine x:list){
if(x.getStatus().equalsIgnoreCase("up")){
//machine with UP status found
System.out.println("The Machine with up status is:"+x.getHostName());
}
}
That's all. Please also note that this code is not tested and may contain some syntactical errors as it is written directly on this editor and not in an IDE.
就这样。另请注意,此代码未经测试,可能包含一些语法错误,因为它是直接在此编辑器上而不是在 IDE 中编写的。
回答by Rohit
The below is a clean generic function to extract an html table into a simple list map structure.
下面是一个干净的通用函数,用于将 html 表提取为简单的列表地图结构。
Pass the document to this function with table order asking for the nth table in the html page.
将文档以表格顺序传递给此函数,要求在 html 页面中查找第 n 个表格。
The function will not return accurate data if the table makes use of rowspan or colspan.
如果表使用 rowspan 或 colspan,该函数将不会返回准确的数据。
public static List<Map<String,String>> parseTable(Document doc, int tableOrder) {
Element table = doc.select("table").get(tableOrder);
Elements rows = table.select("tr");
Elements first = rows.get(0).select("th,td");
List<String> headers = new ArrayList<String>();
for(Element header : first)
headers.add(header.text());
List<Map<String,String>> listMap = new ArrayList<Map<String,String>>();
for(int row=1;row<rows.size();row++) {
Elements colVals = rows.get(row).select("th,td");
//check column size here
int colCount = 0;
Map<String,String> tuple = new HashMap<String,String>();
for(Element colVal : colVals)
tuple.put(headers.get(colCount++), colVal.text());
System.out.println(tuple.toString());
listMap.add(tuple);
}
return listMap;
}