python 分解 HTML 以链接文本和目标

Question

提问by sundeep

Given an HTML link like

给定一个 HTML 链接，如

<a href="urltxt" class="someclass" close="true">texttxt</a>

how can I isolate the url and the text?

如何隔离 url 和文本？

Updates

更新

I'm using Beautiful Soup, and am unable to figure out how to do that.

我正在使用 Beautiful Soup，但我无法弄清楚如何做到这一点。

I did

我做了

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))

links = soup.findAll('a')

for link in links:
    print "link content:", link.content," and attr:",link.attrs

i get

我明白了

*link content: None  and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root    /support.asp')]*  ...
...

Why am i missing the content?

为什么我缺少内容？

edit: elaborated on 'stuck' as advised :)

编辑：按照建议详细说明“卡住”:)

Answer 1

回答by Harley Holcombe

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.

使用美汤。自己动手比看起来更难，最好使用久经考验的模块。

EDIT:

编辑：

I think you want:

我想你想要：

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())

By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.

顺便说一句，尝试在那里打开 URL 是一个坏主意，因为如果出错了它可能会变得丑陋。

EDIT 2:

编辑2：

This should show you all the links in a page:

这应该会显示页面中的所有链接：

import urlparse, urllib
from BeautifulSoup import BeautifulSoup

url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)

for item in soup.fetchall('a'):
    try:
        link =  urlparse.urlparse(item['href'].lower())
    except:
        # Not a valid link
        pass
    else:
        print link

Answer 2

回答by Jerub

Here's a code example, showing getting the attributes and contents of the links:

这是一个代码示例，显示了获取链接的属性和内容：

soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
    print link.attrs, link.contents

Answer 3

回答by Tom

Looks like you have two issues there:

看起来你有两个问题：

link.contents, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.

link.content小号，不link.content
attrs 是字典，而不是字符串。它保存 HTML 元素中每个属性的键值对。link.attrs['href'] 将为您提供您想要的内容，但您希望将其包装在支票中，以防您遇到没有 href 属性的 a 标签。

Answer 4

回答by nickf

Though I suppose the others mightbe correct in pointing you to using Beautiful Soup, they mightnot, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.

虽然我认为其他人在指出您使用 Beautiful Soup 方面可能是正确的，但他们可能不是，并且使用外部库可能会大大超出您的目的。这是一个正则表达式，可以满足您的要求。

/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/

Here's what it matches:

这是它匹配的内容：

'<a href="url" close="true">text</a>'
// Parts: "url", "text"

'<a href="url" close="true">text<span>something</span></a>'
// Parts: "url", "text<span>something</span>"

If you wanted to get justthe text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

如果你希望得到公正的文本（例如：在上面的第二个例子“textsomething”），我只是运行另一个正则表达式在它剥离尖括号之间的任何东西。

python 分解 HTML 以链接文本和目标

提问by sundeep

回答by Harley Holcombe

回答by Jerub

回答by Tom

回答by nickf

相关推荐

最近更新

标签

python 分解 HTML 以链接文本和目标

提问by sundeep

回答by Harley Holcombe

回答by Jerub

回答by Tom

回答by nickf

相关推荐

使用 Python 的 OpenGL

与 Python 的 range 函数等效的 OCaml 成语是什么？

如何合并两个 python 迭代器？

如何从 Python 函数或方法中获取函数或方法的名称？

相关推荐

最近更新

标签