Python BeatifulSoup4 get_text 仍然有 javascript

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22799990/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:43:29  来源:igfitidea点击:

BeatifulSoup4 get_text still has javascript

pythonbeautifulsoupnltk

提问by KVISH

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?

我正在尝试使用 bs4 删除所有 html/javascript,但是,它并没有摆脱 javascript。我仍然在那里看到它的文字。我怎样才能解决这个问题?

I tried using nltkwhich works fine however, clean_htmland clean_urlwill be removed moving forward. Is there a way to use soups get_textand get the same result?

我试着用nltk然而,工作正常,clean_html并且clean_url将被删除向前发展。有没有办法使用汤get_text并获得相同的结果?

I tried looking at these other pages:

我尝试查看这些其他页面:

BeautifulSoup get_text does not strip all tags and JavaScript

BeautifulSoup get_text 不会剥离所有标签和 JavaScript

Currently i'm using the nltk's deprecated functions.

目前我正在使用 nltk 的弃用功能。

EDIT

编辑

Here's an example:

下面是一个例子:

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

I still see the following for CNN:

对于 CNN,我仍然看到以下内容:

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

How can I remove the js?

我怎样才能删除js?

Only other options I found are:

我发现的其他选项只有:

https://github.com/aaronsw/html2text

https://github.com/aaronsw/html2text

The problem with html2textis that it's really reallyslow at times, and creates noticable lag, which is one thing nltk was always very good with.

问题html2text在于它有时真的慢,并且会产生明显的滞后,这是 nltk 一直非常擅长的一件事。

采纳答案by Hugh Bothwell

Based partly on Can I remove script tags with BeautifulSoup?

部分基于我可以使用 BeautifulSoup 删除脚本标签吗?

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

回答by bumpkin

To prevent encoding errors at the end...

为了防止最后出现编码错误...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))