Python 抓取 https://www.thenewboston.com/ 时出现“SSL:certificate_verify_failed”错误
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/34503206/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
"SSL: certificate_verify_failed" error when scraping https://www.thenewboston.com/
提问by Bill Jenkins
So I started learning Python recently using "The New Boston's" videos on youtube, everything was going great until I got to his tutorial of making a simple web crawler. While I understood it with no problem, when I run the code I get errors all seemingly based around "SSL: CERTIFICATE_VERIFY_FAILED." I've been searching for an answer since last night trying to figure out how to fix it, it seems no one else in the comments on the video or on his website are having the same problem as me and even using someone elses code from his website I get the same results. I'll post the code from the one I got from the website as it's giving me the same error and the one I coded is a mess right now.
所以我最近开始使用 youtube 上的“The New Boston's”视频学习 Python,一切都很顺利,直到我学习了他制作简单网络爬虫的教程。虽然我理解它没有问题,但当我运行代码时,我得到的错误似乎都是基于“SSL:CERTIFICATE_VERIFY_FAILED”。自从昨晚试图找出解决方法以来,我一直在寻找答案,似乎视频或他网站上的评论中没有其他人与我有同样的问题,甚至使用他的其他人的代码网站我得到相同的结果。我将发布我从网站上获得的代码,因为它给了我同样的错误,而我编码的代码现在一团糟。
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
source_code = requests.get(url)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
for link in soup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
href = "https://www.thenewboston.com/" + link.get('href')
title = link.string # just the text, not the HTML
print(href)
print(title)
# get_single_item_data(href)
page += 1
trade_spider(1)
The full error is: ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
完整的错误是: ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)
I apologize if this is a dumb question, I'm still new to programming but I seriously can't figure this out, I was thinking about just skipping this tutorial but it's bothering me not being able to fix this, thanks!
如果这是一个愚蠢的问题,我深表歉意,我还是编程新手,但我真的无法弄清楚这一点,我正在考虑跳过本教程,但无法解决这个问题让我很困扰,谢谢!
采纳答案by Steffen Ullrich
The problem is not in your code but in the web site you are trying to access. When looking at the analysis by SSLLabsyou will note:
问题不在于您的代码,而在于您尝试访问的网站。在查看SSLLabs的分析时,您会注意到:
This server's certificate chain is incomplete. Grade capped to B.
此服务器的证书链不完整。等级上限为 B。
This means that the server configuration is wrong and that not only python but several others will have problems with this site. Some desktop browsers work around this configuration problem by trying to load the missing certificates from the internet or fill in with cached certificates. But other browsers or applications will fail too, similar to python.
这意味着服务器配置错误,不仅python而且其他几个站点都会出现问题。一些桌面浏览器通过尝试从 Internet 加载丢失的证书或填充缓存的证书来解决此配置问题。但是其他浏览器或应用程序也会失败,类似于 python。
To work around the broken server configuration you might explicitly extract the missing certificates and add them to you trust store. Or you might give the certificate as trust inside the verify argument. From the documentation:
要解决损坏的服务器配置,您可能会显式提取丢失的证书并将它们添加到您的信任存储中。或者您可以在 verify 参数中将证书作为信任。从文档:
You can pass verify the path to a CA_BUNDLE file or directory with certificates of trusted CAs:
>>> requests.get('https://github.com', verify='/path/to/certfile')
This list of trusted CAs can also be specified through the REQUESTS_CA_BUNDLE environment variable.
您可以通过验证路径到 CA_BUNDLE 文件或具有受信任 CA 证书的目录:
>>> requests.get('https://github.com', verify='/path/to/certfile')
这个受信任的 CA 列表也可以通过 REQUESTS_CA_BUNDLE 环境变量指定。
回答by NuclearPeon
I'm posting this as an answer because I've gotten past your issue thus far, but there's still issues in your code (which when fixed, I can update).
我将此作为答案发布,因为到目前为止我已经解决了您的问题,但是您的代码中仍然存在问题(修复后,我可以更新)。
So long story short: you could be using an old version of requests or the ssl certificate should be invalid. There's more information in this SO question: Python requests "certificate verify failed"
长话短说:您可能正在使用旧版本的请求,或者 ssl 证书应该无效。这个问题有更多信息:Python requests "certificate verify failed"
I've updated the code into my own bsoup.py
file:
我已将代码更新到我自己的bsoup.py
文件中:
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=" + str(page) #this is page of popular posts
source_code = requests.get(url, timeout=5, verify=False)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
href = "https://www.thenewboston.com/" + link.get('href')
title = link.string # just the text, not the HTML
print(href)
print(title)
# get_single_item_data(href)
page += 1
if __name__ == "__main__":
trade_spider(1)
When I run the script, it gives me this error:
当我运行脚本时,它给了我这个错误:
https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1
Traceback (most recent call last):
File "./bsoup.py", line 26, in <module>
trade_spider(1)
File "./bsoup.py", line 16, in trade_spider
for link in BeautifulSoup.findAll('a', {'class': 'index_singleListingTitles'}): #all links, which contains "" class='index_singleListingTitles' "" in it.
File "/usr/local/lib/python3.4/dist-packages/bs4/element.py", line 1256, in find_all
generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'
There's an issue somewhere with your findAll
method. I've used both python3 and python2, wherein python2 reports this:
你的findAll
方法有问题。我已经使用了 python3 和 python2,其中 python2 报告了这个:
TypeError: unbound method find_all() must be called with BeautifulSoup instance as first argument (got str instance instead)
So it looks like you'll need to fix up that method before you can continue
因此,您似乎需要先修复该方法,然后才能继续
回答by mattexx
You can tell requests not to verify the SSL certificate:
您可以告诉请求不要验证 SSL 证书:
>>> url = "https://www.thenewboston.com/forum/category.php?id=15&orderby=recent&page=1"
>>> response = requests.get(url, verify=False)
>>> response.status_code
200
See more in the requests
doc
回答by markhor
You are probably missing the stock certificates in your system. E.g. if running on Ubuntu, check that ca-certificates
package is installed.
您可能缺少系统中的股票证书。例如,如果在 Ubuntu 上运行,请检查ca-certificates
是否安装了软件包。
回答by amitnair92
if you want to use the Python dmg installer, you also have to read Python 3's ReadMe and run the bash command to get new certificates.
如果要使用 Python dmg 安装程序,还必须阅读 Python 3 的自述文件并运行 bash 命令以获取新证书。
Try running
尝试跑步
/Applications/Python\ 3.6/Install\ Certificates.command
回答by Kent
I spent several hours trying to fix some Python and update certs on a VM. In my case I was working against a server that someone else had set up. It turned out that the wrong cert had been uploaded to the server. I found this command on another SO answer.
我花了几个小时试图修复一些 Python 并更新 VM 上的证书。就我而言,我正在使用其他人设置的服务器。结果是错误的证书已上传到服务器。我在另一个 SO 答案上找到了这个命令。
root@ubuntu:~/cloud-tools# openssl s_client -connect abc.def.com:443
CONNECTED(00000005)
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 OU = Domain Control Validated, CN = abc.def.com
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
0 s:OU = Domain Control Validated, CN = abc.def.com
i:C = US, ST = Arizona, L = Scottsdale, O = "GoDaddy.com, Inc.", OU = http://certs.godaddy.com/repository/, CN = Go Daddy Secure Certificate Authority - G2