Python 使用 BeautifulSoup 获取属性值

Question

提问by aditya.gupta

I'm writing a python script which will extract the script locations after parsing from a webpage. Lets say there are two scenarios :

我正在编写一个 python 脚本，它将在从网页解析后提取脚本位置。假设有两种情况：

<script type="text/javascript" src="http://example.com/something.js"></script>

and

和

<script>some JS</script>

I'm able to get the JS from the second scenario, that is when the JS is written within the tags.

我能够从第二个场景中获取 JS，也就是在标签中编写 JS 时。

But is there any way, I could get the value of src from the first scenario (i.e extracting all the values of src tags within script such as http://example.com/something.js)

但是有没有办法，我可以从第一个场景中获取 src 的值（即在脚本中提取 src 标签的所有值，例如http://example.com/something.js）

Here's my code

这是我的代码

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print n

Output: Some JS

输出：一些JS

Output Needed: http://example.com/something.js

需要输出：http: //example.com/something.js

Answer 1

采纳答案by Venkateshwaran Selvaraj

It will get all the srcvalues only if they are present. Or else it would skip that <script>tag

src只有当它们存在时，它才会获得所有值。否则它会跳过那个<script>标签

from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
 print source['src']

I am getting following two srcvalues as result

结果我得到以下两个 src值

http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js

I guess this is what you want. Hope this is useful.

我想这就是你想要的。希望这是有用的。

Answer 2

回答by rajpy

Get 'src' from script node.

从脚本节点获取“src”。

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print "src:", n.get('src') <====

Answer 3

回答by Ashok Fernandez

This should work, you just filter to find all the script tags, then determine if they have a 'src' attribute. If they do then the URL to the javascript is contained in the src attribute, otherwise we assume the javascript is in the tag

这应该有效，您只需过滤以查找所有脚本标签，然后确定它们是否具有 'src' 属性。如果他们这样做，那么 javascript 的 URL 包含在 src 属性中，否则我们假设 javascript 在标签中

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

# Test HTML which has both cases
html = '<script type="text/javascript" src="http://example.com/something.js">'
html += '</script>  <script>some JS</script>'

soup = BeautifulSoup(html)

# Find all script tags 
for n in soup.find_all('script'):

    # Check if the src attribute exists, and if it does grab the source URL
    if 'src' in n.attrs:
        javascript = n['src']

    # Otherwise assume that the javascript is contained within the tags
    else:
        javascript = n.text

    print javascript

This output of this is

这个输出是

http://example.com/something.js
some JS

Python 使用 BeautifulSoup 获取属性值

提问by aditya.gupta

采纳答案by Venkateshwaran Selvaraj

回答by rajpy

回答by Ashok Fernandez

相关推荐

最近更新

标签

Python 使用 BeautifulSoup 获取属性值

提问by aditya.gupta

采纳答案by Venkateshwaran Selvaraj

回答by rajpy

回答by Ashok Fernandez

相关推荐

Python 查找两个字符串之间的公共子字符串

Python 从大量 .txt 文件及其频率生成 Ngrams（Unigrams、Bigrams 等）

Python：不能分配给文字

Python “UCS-2”编解码器无法对位置 1050-1050 中的字符进行编码

相关推荐

最近更新

标签