Html 如何在 R 中读取和解析网页的内容

Question

提问by Mark

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it

我想在 R 中阅读 URL (eq, http://www.haaretz.com/) 的内容。我想知道我该怎么做

Answer 1

回答by Shane

Not really sure how you want to process that page, because it's really messy. As we re-learned in this famous stackoverflow question, it's not a good idea to do regex on html, so you will definitely want to parse this with the XML package.

不太确定您想如何处理该页面，因为它真的很乱。正如我们在这个著名的 stackoverflow 问题中重新学习的那样，在 html 上执行正则表达式并不是一个好主意，因此您肯定希望使用 XML 包来解析它。

Here's an example to get you started:

这是一个让您入门的示例：

require(RCurl)
require(XML)
webpage <- getURL("http://www.haaretz.com/")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
# parse the tree by tables
x <- xpathSApply(pagetree, "//*/table", xmlValue)  
# do some clean up with regular expressions
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]

This results in a character vector of mostly just webpage text (along with some javascript):

这会产生一个主要由网页文本组成的字符向量（以及一些 javascript）：

> head(x)
[1] "Subscribe to Print Edition"              "Fri., December 04, 2009 Kislev 17, 5770" "Israel Time:??16:48??(EST+7)"           
[4] "????Make Haaretz your homepage"          "/*check the search form*/"               "function chkSearch()"

Answer 2

回答by Dirk Eddelbuettel

Your best bet may be the XML package -- see for example this previous question.

您最好的选择可能是 XML 包 — 例如，请参见上一个问题。

Answer 3

回答by Andreas

I know you asked for R. But maybe python+beautifullsoup is the way forward here? Then do your analysis with R you have scraped the screen with beautifullsoup?

我知道你要求 R。但也许 python+beautifullsoup 是这里的前进方向？然后用R做你的分析你用beautifullsoup刮了屏幕？

Html 如何在 R 中读取和解析网页的内容

提问by Mark

回答by Shane

回答by Dirk Eddelbuettel

回答by Andreas

相关推荐

最近更新

标签

Html 如何在 R 中读取和解析网页的内容

提问by Mark

回答by Shane

回答by Dirk Eddelbuettel

回答by Andreas

相关推荐

Html 内联表单中的 Bootstrap 全角文本输入

Html 无序列表项的文本对齐

如何将我计算机上的图像以 css 或 html 格式添加到站点？

Html 显示：内联块额外边距

相关推荐

最近更新

标签