Python BeatifulSoup4 get_text 仍然有 javascript
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22799990/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
BeatifulSoup4 get_text still has javascript
提问by KVISH
I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?
我正在尝试使用 bs4 删除所有 html/javascript,但是,它并没有摆脱 javascript。我仍然在那里看到它的文字。我怎样才能解决这个问题?
I tried using nltk
which works fine however, clean_html
and clean_url
will be removed moving forward. Is there a way to use soups get_text
and get the same result?
我试着用nltk
然而,工作正常,clean_html
并且clean_url
将被删除向前发展。有没有办法使用汤get_text
并获得相同的结果?
I tried looking at these other pages:
我尝试查看这些其他页面:
BeautifulSoup get_text does not strip all tags and JavaScript
BeautifulSoup get_text 不会剥离所有标签和 JavaScript
Currently i'm using the nltk's deprecated functions.
目前我正在使用 nltk 的弃用功能。
EDIT
编辑
Here's an example:
下面是一个例子:
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()
I still see the following for CNN:
对于 CNN,我仍然看到以下内容:
$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});
/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});
How can I remove the js?
我怎样才能删除js?
Only other options I found are:
我发现的其他选项只有:
https://github.com/aaronsw/html2text
https://github.com/aaronsw/html2text
The problem with html2text
is that it's really reallyslow at times, and creates noticable lag, which is one thing nltk was always very good with.
问题html2text
在于它有时真的很慢,并且会产生明显的滞后,这是 nltk 一直非常擅长的一件事。
采纳答案by Hugh Bothwell
Based partly on Can I remove script tags with BeautifulSoup?
部分基于我可以使用 BeautifulSoup 删除脚本标签吗?
import urllib
from bs4 import BeautifulSoup
url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
回答by bumpkin
To prevent encoding errors at the end...
为了防止最后出现编码错误...
import urllib
from bs4 import BeautifulSoup
url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text.encode('utf-8'))