Html 如何在 R 中读取和解析网页的内容

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1844829/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 01:28:53  来源:igfitidea点击:

How can I read and parse the contents of a webpage in R

htmlrscreen-scrapinghtml-content-extraction

提问by Mark

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it

我想在 R 中阅读 URL (eq, http://www.haaretz.com/) 的内容。我想知道我该怎么做

回答by Shane

Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.

不太确定您想如何处理该页面,因为它真的很乱。正如我们在这个著名的 stackoverflow 问题中重新学习的那样,在 html 上执行正则表达式并不是一个好主意,因此您肯定希望使用 XML 包来解析它。

Here's an example to get you started:

这是一个让您入门的示例:

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

This results in a character vector of mostly just webpage text (along with some javascript):

这会产生一个主要由网页文本组成的字符向量(以及一些 javascript):

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:??16:48??(EST+7)"           
[4] "????Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()" 

回答by Dirk Eddelbuettel

Your best bet may be the XML package -- see for example this previous question.

您最好的选择可能是 XML 包 — 例如,请参见上一个问题

回答by Andreas

I know you asked for R. But maybe python+beautifullsoup is the way forward here? Then do your analysis with R you have scraped the screen with beautifullsoup?

我知道你要求 R。但也许 python+beautifullsoup 是这里的前进方向?然后用R做你的分析你用beautifullsoup刮了屏幕?