使用 XML 包将 html 表抓取到 R 数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1395528/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scraping html tables into R data frames using the XML package
提问by Eduardo Leoni
How do I scrape html tables using the XML package?
如何使用 XML 包抓取 html 表?
Take, for example, this wikipedia page on the Brazilian soccer team. I would like to read it in R and get the "list of all matches Brazil have played against FIFA recognised teams" table as a data.frame. How can I do this?
以巴西足球队的这个维基百科页面为例。我想在 R 中阅读它并获得“巴西与国际足联认可的球队进行的所有比赛的列表”表作为数据框。我怎样才能做到这一点?
采纳答案by Eduardo Leoni
…or a shorter try:
……或者更短的尝试:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
the picked table is the longest one on the page
选中的表格是页面上最长的表格
tables[[which.max(n.rows)]]
回答by Richie Cotton
library(RCurl)
library(XML)
# Download page using RCurl
# You may need to set proxy details, etc., in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})
# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]
#In this case, the required table is the only one with class "wikitable sortable"
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable <- tables[which(tableclasses=="wikitable sortable")]$table
#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))
# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
tablerow <- thetable$children[[i]]$children
opponent <- tablerow[[1]]$children[[2]]$children$text$value
others <- unname(sapply(tablerow[-1], function(x) x$children$text$value))
content <- rbind(content, c(opponent, others))
}
# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)
Edited to add:
编辑添加:
Sample output
样本输出
Opponent Played Won Drawn Lost Goals for Goals against ?% Won
1 Argentina 94 36 24 34 148 150 38.3%
2 Paraguay 72 44 17 11 160 61 61.1%
3 Uruguay 72 33 19 20 127 93 45.8%
...
回答by learnr
Another option using Xpath.
使用 Xpath 的另一种选择。
library(RCurl)
library(XML)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)
# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))
# Clean up the results
content[,1] <- gsub("??", "", content[,1])
tablehead <- gsub("??", "", tablehead)
names(content) <- tablehead
Produces this result
产生这个结果
> head(content)
Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina 94 36 24 34 148 150 38.3%
2 Paraguay 72 44 17 11 160 61 61.1%
3 Uruguay 72 33 19 20 127 93 45.8%
4 Chile 64 45 12 7 147 53 70.3%
5 Peru 39 27 9 3 83 27 69.2%
6 Mexico 36 21 6 9 69 34 58.3%
回答by Dave2e
The rvest
along with xml2
is another popular package for parsing html web pages.
The rvest
withxml2
是另一个流行的用于解析 html 网页的包。
library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)
The syntax is easier to use than the xml
package and for most web pages the package provides all of the options ones needs.
语法比xml
包更容易使用,对于大多数网页,包提供了所有需要的选项。