在python中使用selenium获取所有href链接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/34759787/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 15:29:34  来源:igfitidea点击:

Fetch all href link using selenium in python

pythonseleniumselenium-webdriverweb-scraping

提问by Xonshiz

I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.

我正在 Python 中练习 Selenium,我想使用 Selenium 获取网页上的所有链接。

For example, I want all the links in the href=property of all the <a>tags on http://psychoticelites.com/

例如,我想要http://psychoticelites.com/href=上所有<a>标签的属性中的所有链接

I've written a script and it is working. But, it's giving me the object address. I've tried using the idtag to get the value, but, it doesn't work.

我写了一个脚本,它正在运行。但是,它给了我对象地址。我试过使用id标签来获取值,但是,它不起作用。

My current script:

我目前的脚本:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys


driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")

assert "Psychotic" in driver.title

continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[@href]")
#x = str(continue_link)
#print(continue_link)
print(elem)

采纳答案by JRodDynamite

Well, you have to simply loop through the list:

好吧,您必须简单地遍历列表:

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

find_elements_by_*returns a list of elements (note the spelling of 'elements'). Loop through the list, take each element and fetch the required attribute value you want from it (in this case href).

find_elements_by_*返回元素列表(注意“元素”的拼写)。遍历列表,获取每个元素并从中获取所需的属性值(在本例中为href)。

回答by Python_Novice

You can import the HTML dom using html dom library in python. You can find it over here and install it using PIP:

您可以使用 python 中的 html dom 库导入 HTML dom。你可以在这里找到它并使用 PIP 安装它:

https://pypi.python.org/pypi/htmldom/2.0

https://pypi.python.org/pypi/htmldom/2.0

from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")  
dom = dom.createDom()

The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. Once the dom object is created, you need to call "createDom" method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.

上面的代码创建了一个 HtmlDom 对象。HtmlDom 接受一个默认参数,即页面的 url。创建 dom 对象后,您需要调用HtmlDom 的“createDom”方法。这将解析 html 数据并构建解析树,然后可用于搜索和操作 html 数据。库强加的唯一限制是数据无论是 html 还是 xml 都必须具有根元素。

You can query the elements using the "find" method of HtmlDom object:

您可以使用 HtmlDom 对象的“find”方法查询元素:

p_links = dom.find("a")  
for link in p_links:
  print ("URL: " +link.attr("href"))

The above code will print all the links/urls present on the web page

上面的代码将打印网页上存在的所有链接/网址

回答by Shawn

You can try something like:

您可以尝试以下操作:

    links = driver.find_elements_by_partial_link_text('')

回答by Anupriya Nishad

import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:\chromedrivers\chromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
    print(link)

回答by Gabriel Chung

I have checked and tested that there is a function named find_elements_by_tag_name() you can use. This example works fine for me.

我已经检查并测试过您可以使用名为 find_elements_by_tag_name() 的函数。这个例子对我来说很好。

elems = driver.find_elements_by_tag_name('a')
    for elem in elems:
        href = elem.get_attribute('href')
        if href is not None:
            print(href)