Python 类型错误：必须是 str，而不是 NoneType

Question

提问by Dylan Boyd

I'm writing my first "real" project, a web crawler, and I don't know how to fix this error. Here's my code

我正在编写我的第一个“真实”项目，一个网络爬虫，但我不知道如何修复这个错误。这是我的代码

import requests
from bs4 import BeautifulSoup

def main_spider(max_pages):
    page = 1
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
            print(href)
    page += 1

main_spider(1)

Here's the error

这是错误

href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
TypeError: must be str, not NoneType

Answer 1

回答by Hymanywathy

The first "a" link on the wikipedia page is

维基百科页面上的第一个“a”链接是

<a id="top"></a>

Therefore, link.get("href") will return None, as there is no href.

因此，link.get("href") 将返回 None，因为没有 href。

To fix this, check for None first:

要解决此问题，请先检查 None：

if link.get('href') is not None:
    href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
    # do stuff here

Answer 2

回答by MSeifert

Not all anchors (<a>elements) need to have a hrefattribute (see https://www.w3schools.com/tags/tag_a.asp):

并非所有锚点（<a>元素）都需要有一个href属性（参见https://www.w3schools.com/tags/tag_a.asp）：

In HTML5, the tag is always a hyperlink, but if it has no href attribute, it is only a placeholder for a hyperlink.

在 HTML5 中，标签总是一个超链接，但如果它没有 href 属性，它只是一个超链接的占位符。

Actually you already got the Exception and Python is great at handling exceptions so why not catch the exception? This style is called "Easier to ask for forgiveness than permission." (EAFP)and is actually encouraged:

实际上你已经得到了 Exception 并且 Python 非常擅长处理异常，那么为什么不捕获异常呢？这种风格被称为“请求宽恕比许可更容易”。(EAFP)并且实际上被鼓励：

import requests
from bs4 import BeautifulSoup

def main_spider(max_pages):
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            # The following part is new:
            try:
                href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
                print(href)
            except TypeError:
                pass

main_spider(1)

Also the page = 1and page += 1lines can be omitted. The for page in range(1, max_pages+1):instruction is already sufficient here.

另外，page = 1和page += 1线可以被省略。这里的for page in range(1, max_pages+1):说明已经足够了。

Answer 3

回答by E. Ducateme

As noted by @Shiping, your code is not indented properly ... I corrected it below. Also... link.get('href')is not returning a string in one of the cases.

正如@Shiping 所指出的，您的代码没有正确缩进......我在下面更正了它。另外...link.get('href')在其中一种情况下不返回字符串。

import requests
from bs4 import BeautifulSoup

def main_spider(max_pages):
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"): 

            href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
            print(href)

main_spider(1)

For purposes of evaluating what was happening, I added several lines of code...between several of your existing lines AND removed the offending line (for the time being).

为了评估正在发生的事情，我添加了几行代码......在您现有的几行之间并删除了有问题的行（暂时）。

        soup = BeautifulSoup(plain_text, "html.parser")
        print('All anchor tags:', soup.findAll('a'))     ### ADDED
        for link in soup.findAll("a"): 
            print(type(link.get("href")), link.get("href"))  ### ADDED

The result of my additions was this (truncated for brevity): NOTE: that the first anchor does NOT have an href attribute and thus link.get('href')can't return a value, so returns None

我添加的结果是这样的（为简洁起见被截断）：注意：第一个锚点没有 href 属性，因此link.get('href')无法返回值，因此返回None

[<a id="top"></a>, <a href="#mw-head">navigation</a>, 
<a href="#p-search">search</a>, 
<a href="/wiki/Special:SiteMatrix" title="Special:SiteMatrix">sister...   
<class 'NoneType'> None
<class 'str'> #mw-head
<class 'str'> #p-search
<class 'str'> /wiki/Special:SiteMatrix
<class 'str'> /wiki/File:Wiktionary-logo-v2.svg      
...

To prevent the error, a possible solution would be to add a conditional OR a try/except expression to your code. I'll demo a conditional expression.

为防止该错误，一个可能的解决方案是在您的代码中添加一个条件 OR 一个 try/except 表达式。我将演示一个条件表达式。

        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"): 
            if link.get('href') == None:
                continue
            else:
                href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
                print(href)

Answer 4

回答by dhhepting

I had the same error from different code. After adding a conditional inside a function, I thought that the return type was not being set properly, but what I realized was that when the condition was False, the return statement was not being called at all -- a change to my indentation fixed the problem.

我有来自不同代码的相同错误。在函数中添加条件后，我认为返回类型设置不正确，但我意识到当条件为 False 时，根本没有调用 return 语句 - 缩进的更改修复了问题。

Python 类型错误：必须是 str，而不是 NoneType

提问by Dylan Boyd

回答by Hymanywathy

回答by MSeifert

回答by E. Ducateme

回答by dhhepting

相关推荐

最近更新

标签

Python 类型错误：必须是 str，而不是 NoneType

提问by Dylan Boyd

回答by Hymanywathy

回答by MSeifert

回答by E. Ducateme

回答by dhhepting

相关推荐

用 Anaconda 安装了一个包，无法在 Python 中导入

如何在python 3.6中用f-string做字典格式？

在 Python 中的同一图中绘制列表列表

Python 如何在 JupyterHub 中设置 NotebookApp.iopub_data_rate_limit 和其他 NotebookApp 设置？

相关推荐

最近更新

标签