python 分解 HTML 以链接文本和目标

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/285938/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 19:49:14  来源:igfitidea点击:

Decomposing HTML to link text and target

pythonhtmlregexbeautifulsoup

提问by sundeep

Given an HTML link like

给定一个 HTML 链接,如

<a href="urltxt" class="someclass" close="true">texttxt</a>

how can I isolate the url and the text?

如何隔离 url 和文本?

Updates

更新

I'm using Beautiful Soup, and am unable to figure out how to do that.

我正在使用 Beautiful Soup,但我无法弄清楚如何做到这一点。

I did

我做了

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))

links = soup.findAll('a')

for link in links:
    print "link content:", link.content," and attr:",link.attrs

i get

我明白了

*link content: None  and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root    /support.asp')]*  ...
...

Why am i missing the content?

为什么我缺少内容?

edit: elaborated on 'stuck' as advised :)

编辑:按照建议详细说明“卡住”:)

回答by Harley Holcombe

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.

使用美汤。自己动手比看起来更难,最好使用久经考验的模块。

EDIT:

编辑:

I think you want:

我想你想要:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

顺便说一句,尝试在那里打开 URL 是一个坏主意,因为如果出错了它可能会变得丑陋。

EDIT 2:

编辑2:

This should show you all the links in a page:

这应该会显示页面中的所有链接:

import urlparse, urllib
from BeautifulSoup import BeautifulSoup

url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)

for item in soup.fetchall('a'):
    try:
        link =  urlparse.urlparse(item['href'].lower())
    except:
        # Not a valid link
        pass
    else:
        print link

回答by Jerub

Here's a code example, showing getting the attributes and contents of the links:

这是一个代码示例,显示了获取链接的属性和内容:

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
    print link.attrs, link.contents

回答by Tom

Looks like you have two issues there:

看起来你有两个问题:

  1. link.contents, not link.content
  2. attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
  1. link.content小号,不link.content
  2. attrs 是字典,而不是字符串。它保存 HTML 元素中每个属性的键值对。link.attrs['href'] 将为您提供您想要的内容,但您希望将其包装在支票中,以防您遇到没有 href 属性的 a 标签。

回答by nickf

Though I suppose the others mightbe correct in pointing you to using Beautiful Soup, they mightnot, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.

虽然我认为其他人在指出您使用 Beautiful Soup 方面可能是正确的,但他们可能不是,并且使用外部库可能会大大超出您的目的。这是一个正则表达式,可以满足您的要求。

/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/

Here's what it matches:

这是它匹配的内容:

'<a href="url" close="true">text</a>'
// Parts: "url", "text"

'<a href="url" close="true">text<span>something</span></a>'
// Parts: "url", "text<span>something</span>"

If you wanted to get justthe text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

如果你希望得到公正的文本(例如:在上面的第二个例子“textsomething”),我只是运行另一个正则表达式在它剥离尖括号之间的任何东西。