Python 使用 BeautifulSoup 获取属性值
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/18733023/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Getting attribute's value using BeautifulSoup
提问by aditya.gupta
I'm writing a python script which will extract the script locations after parsing from a webpage. Lets say there are two scenarios :
我正在编写一个 python 脚本,它将在从网页解析后提取脚本位置。假设有两种情况:
<script type="text/javascript" src="http://example.com/something.js"></script>
and
和
<script>some JS</script>
I'm able to get the JS from the second scenario, that is when the JS is written within the tags.
我能够从第二个场景中获取 JS,也就是在标签中编写 JS 时。
But is there any way, I could get the value of src from the first scenario (i.e extracting all the values of src tags within script such as http://example.com/something.js)
但是有没有办法,我可以从第一个场景中获取 src 的值(即在脚本中提取 src 标签的所有值,例如http://example.com/something.js)
Here's my code
这是我的代码
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
r = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
print n
Output: Some JS
输出:一些JS
Output Needed: http://example.com/something.js
采纳答案by Venkateshwaran Selvaraj
It will get all the src
values only if they are present. Or else it would skip that <script>
tag
src
只有当它们存在时,它才会获得所有值。否则它会跳过那个<script>
标签
from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
print source['src']
I am getting following two src
values as result
结果我得到以下两个 src
值
http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js
I guess this is what you want. Hope this is useful.
我想这就是你想要的。希望这是有用的。
回答by rajpy
Get 'src' from script node.
从脚本节点获取“src”。
import requests
from bs4 import BeautifulSoup
r = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
print "src:", n.get('src') <====
回答by Ashok Fernandez
This should work, you just filter to find all the script tags, then determine if they have a 'src' attribute. If they do then the URL to the javascript is contained in the src attribute, otherwise we assume the javascript is in the tag
这应该有效,您只需过滤以查找所有脚本标签,然后确定它们是否具有 'src' 属性。如果他们这样做,那么 javascript 的 URL 包含在 src 属性中,否则我们假设 javascript 在标签中
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
# Test HTML which has both cases
html = '<script type="text/javascript" src="http://example.com/something.js">'
html += '</script> <script>some JS</script>'
soup = BeautifulSoup(html)
# Find all script tags
for n in soup.find_all('script'):
# Check if the src attribute exists, and if it does grab the source URL
if 'src' in n.attrs:
javascript = n['src']
# Otherwise assume that the javascript is contained within the tags
else:
javascript = n.text
print javascript
This output of this is
这个输出是
http://example.com/something.js
some JS