xml 在 R 中抓取受密码保护的网站
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24723606/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scrape password-protected website in R
提问by itpetersen
I'm trying to scrape data from a password-protected website in R. Reading around, it seems that the httr and RCurl packages are the best options for scraping with password authentication (I've also looked into the XML package).
我正在尝试从 R 中受密码保护的网站抓取数据。 阅读周围,似乎 httr 和 RCurl 包是使用密码身份验证进行抓取的最佳选择(我还研究了 XML 包)。
The website I'm trying to scrape is below (you need a free account in order to access the full page): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2
我正在尝试抓取的网站如下(您需要一个免费帐户才能访问完整页面):http: //subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2
Here are my two attempts (replacing "username" with my username and "password" with my password):
这是我的两次尝试(用我的用户名替换“用户名”,用我的密码替换“密码”):
#This returns "Status: 200" without the data from the page:
library(httr)
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))
#This returns the non-password protected preview (i.e., not the full page):
library(XML)
library(RCurl)
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))
I have looked at other relevant posts (links below), but can't figure out how to apply their answers to my case.
我查看了其他相关帖子(下面的链接),但无法弄清楚如何将他们的答案应用于我的案例。
How to use R to download a zipped file from a SSL page that requires cookies
如何使用 R 从需要 cookie 的 SSL 页面下载压缩文件
How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
如何在 R(https 链接)中抓取安全页面(使用 XML 包中的 readHTMLTable)?
Reading information from a password protected site
R - RCurl scrape data from a password-protected site
http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold
http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold
采纳答案by Stefan
I don't have an account to test with, but maybe this will work:
我没有要测试的帐户,但也许这会起作用:
library(httr)
library(XML)
handle <- handle("http://subscribers.footballguys.com")
path <- "amember/login.php"
# fields found in the login form.
login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)
response <- POST(handle = handle, path = path, body = login)
Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handlemight be re-used for subsequent requests. Can't test it; but this works for me in many situations.
现在,响应对象可能包含您需要的内容(或者您可以在登录请求后直接查询感兴趣的页面;我不确定重定向是否有效,但它是 Web 表单中的一个字段),并且handle可能会重新- 用于后续请求。无法测试;但这在许多情况下对我有用。
You can output the table using XML
您可以使用输出表 XML
> readHTMLTable(content(response))[[1]][1:5,]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
回答by jdharrison
You can use RSelenium. I have used the dev version as you can run phantomjswithout a Selenium Server.
您可以使用 RSelenium。我使用了开发版本,因为您可以在phantomjs没有 Selenium 服务器的情况下运行。
# Install RSelenium if required. You will need phantomjs in your path or follow instructions
# in package vignettes
# devtools::install_github("ropensci/RSelenium")
# login first
appURL <- 'http://subscribers.footballguys.com/amember/login.php'
library(RSelenium)
pJS <- phantom() # start phantomjs
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()
appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'
remDr$navigate(appURL)
tableElem<- remDr$findElement("css", "table.datamedium")
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
> res[[1]][1:5, ]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
Finally when you are finished close phantomjs
最后当你完成关闭 phantomjs
pJS$stop()
If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:
例如,如果您想使用像 firefox 这样的传统浏览器(如果您想坚持使用 CRAN 上的版本),您可以使用:
RSelenium::startServer()
remDr <- remoteDriver()
........
........
remDr$closeServer()
in place of the related phantomjscalls.
代替相关phantomjs调用。

