Python Web 爬虫和“获取”html 源代码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/3533528/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:36:44  来源:igfitidea点击:

Python Web Crawlers and "getting" html source code

pythongetweb-crawler

提问by Dan

So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems 1. httplib.HTTPConnectionand requestconcept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.

所以我哥哥想让我用 Python 编写一个网络爬虫(自学),我知道 C++、Java 和一些 html。我使用的是2.7版本,并阅读Python库,但我有几个问题1.httplib.HTTPConnectionrequest概念对我来说是新的,如果它下载如cookie或实例的HTML脚本,我不明白。如果您同时执行这两项操作,您是否获得了网站页面的来源?我需要知道哪些词才能修改页面并返回修改后的页面。

Just for background, I need to download a page and replace any img with ones I have

仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的

And it would be nice if you guys could tell me your opinion of 2.7 and 3.1

如果你们能告诉我你们对 2.7 和 3.1 的看法就好了

采纳答案by leoluk

Use Python 2.7, is has more 3rd party libs at the moment.(Edit:see below).

使用 Python 2.7,目前有更多的 3rd 方库。编辑:见下文)。

I recommend you using the stdlib module urllib2, it will allow you to comfortably get web resources. Example:

我建议您使用 stdlib 模块urllib2,它可以让您轻松获取网络资源。例子:

import urllib2

response = urllib2.urlopen("http://google.de")
page_source = response.read()

For parsing the code, have a look at BeautifulSoup.

要解析代码,请查看BeautifulSoup.

BTW: what exactly do you want to do:

顺便说一句:你到底想做什么:

Just for background, I need to download a page and replace any img with ones I have

仅作为背景,我需要下载一个页面并将任何 img 替换为我拥有的

Edit:It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requestsis a very nice high-level library which is easier to use than urllib2.

编辑:现在是 2014 年,大多数重要的库都已移植,如果可以,您绝对应该使用 Python 3。python-requests是一个非常好的高级库,比urllib2.

回答by Jim Garrison

The first thing you need to do is read the HTTP specwhich will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.

您需要做的第一件事是阅读HTTP 规范,该规范将解释您可以通过网络接收到的内容。内容中返回的数据将是“渲染”的网页,而不是源。源可以是一个 JSP、一个 servlet、一个 CGI 脚本,简而言之,几乎任何东西,你都无法访问。您只能获得服务器发送给您的 HTML。在静态 HTML 页面的情况下,是的,您将看到“源”。但是对于其他任何内容,您都会看到生成的 HTML,而不是源代码。

When you say modify the page and return the modified pagewhat do you mean?

当你说modify the page and return the modified page你是什​​么意思?

回答by Timo

An Example with python3and the requestslibrary as mentioned by @leoluk:

@leoluk 提到的带有python3requests库的示例:

pip install requests

Script req.py:

脚本 req.py:

import requests

url='http://localhost'

# in case you need a session
cd = { 'sessionid': '123..'}

r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content

Now,execute it and you will get the html source of localhost!

现在,执行它,您将获得 localhost 的 html 源代码!

python3 req.py

python3 req.py

回答by Caner

If you are using Python > 3.xyou don't need to install any libraries, this is directly built in the python framework. The old urllib2package has been renamed to urllib:

如果你正在使用Python > 3.x你不需要安装任何库,这是直接在python框架中构建的。旧urllib2包已重命名为urllib

from urllib import request

response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)