如何在 Ruby 中获取网页的 HTML 源代码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/4217223/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-29 05:20:24  来源:igfitidea点击:

How to get the HTML source of a webpage in Ruby

htmlruby

提问by Eric

In browsers such as Firefox or Safari, with a website open, I can right click the page, and select something like: "View Page Source" or "View Source." This shows the HTML source for the page.

在 Firefox 或 Safari 等浏览器中,打开网站后,我可以右键单击该页面,然后选择诸如“查看页面源代码”或“查看源代码”之类的内容。这显示了页面的 HTML 源代码。

In Ruby, is there a function (maybe a library) that allows me to store this HTML source as a variable? Something like this:

在 Ruby 中,是否有一个函数(可能是一个库)允许我将此 HTML 源代码存储为变量?像这样的东西:

source = view_source(http://stackoverflow.com)

where source would be this text:

来源将是此文本:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Stack Overflow</title>
etc

回答by robbrit

Use Net::HTTP:

使用Net::HTTP

require 'net/http'

source = Net::HTTP.get('stackoverflow.com', '/index.html')

回答by Nakilon

require 'open-uri'
source = open(url){|f|f.read}

UPD: more modern syntax

UPD:更现代的语法

require 'open-uri'
source = open(url, &:read)

回答by Matt Rose

require 'open-uri'
source = open(url).read

short, simple, sweet.

简短,简单,甜蜜。

回答by Skilldrick

Yes, like this:

是的,像这样:

require 'open-uri'

open('http://stackoverflow.com') do |file|
    #use the source Eric
    #e.g. file.each_line { |line| puts line }
end

回答by Josh Lee

You could use the builtin Net::HTTP:

您可以使用内置的Net::HTTP

>> require 'net/http'
>> Net::HTTP.get 'stackoverflow.com', '/'

Or one of the several libraries suggested in "Equivalent of cURL for Ruby?".

或者“相当于 Ruby 的 cURL?”中建议的几个库之一。

回答by Beanish

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://google.com/')

puts page.body

you can then do a lot of other cool stuff with mechanize as well.

然后你也可以用机械化做很多其他很酷的事情。

回答by Topher Fangio

Another thing you might be interested in is Nokogiri. It is an HTML, XML, etc. parser that is very easy to use. Their front page has some example code that should get you started and see if it's right for what you need.

您可能感兴趣的另一件事是Nokogiri。它是一个非常易于使用的 HTML、XML 等解析器。他们的首页有一些示例代码,可以帮助您入门并查看它是否适合您的需要。

回答by Phrogz

If you have cURLinstalled, you could simply:

如果您安装了cURL,您可以简单地:

url = 'http://stackoverflow.com'
html = `curl #{url}`

If you want to use pure Ruby, look at the Net::HTTPlibrary:

如果要使用纯 Ruby,请查看Net::HTTP库:

require 'net/http'
stack = Net::HTTP.new 'stackoverflow.com'
# ...later...
page = '/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby'
html = stack.get(page).body