Python BeatifulSoup4 get_text 仍然有 javascript

Question

提问by KVISH

I'm trying to remove all the html/javascript using bs4, however, it doesn't get rid of javascript. I still see it there with the text. How can I get around this?

我正在尝试使用 bs4 删除所有 html/javascript，但是，它并没有摆脱 javascript。我仍然在那里看到它的文字。我怎样才能解决这个问题？

I tried using nltkwhich works fine however, clean_htmland clean_urlwill be removed moving forward. Is there a way to use soups get_textand get the same result?

我试着用nltk然而，工作正常，clean_html并且clean_url将被删除向前发展。有没有办法使用汤get_text并获得相同的结果？

I tried looking at these other pages:

我尝试查看这些其他页面：

BeautifulSoup get_text does not strip all tags and JavaScript

BeautifulSoup get_text 不会剥离所有标签和 JavaScript

Currently i'm using the nltk's deprecated functions.

目前我正在使用 nltk 的弃用功能。

EDIT

编辑

Here's an example:

下面是一个例子：

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
print soup.get_text()

I still see the following for CNN:

对于 CNN，我仍然看到以下内容：

$j(function() {
"use strict";
if ( window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv() ) {
var pushLib = window.safaripushLib,
current = pushLib.currentPermissions();
if (current === "default") {
pushLib.checkPermissions("helloClient", function() {});
}
}
});

/*globals MainLocalObj*/
$j(window).load(function () {
'use strict';
MainLocalObj.init();
});

How can I remove the js?

我怎样才能删除js？

Only other options I found are:

我发现的其他选项只有：

https://github.com/aaronsw/html2text

The problem with html2textis that it's really reallyslow at times, and creates noticable lag, which is one thing nltk was always very good with.

问题html2text在于它有时真的很慢，并且会产生明显的滞后，这是 nltk 一直非常擅长的一件事。

Answer 1

采纳答案by Hugh Bothwell

Based partly on Can I remove script tags with BeautifulSoup?

部分基于我可以使用 BeautifulSoup 删除脚本标签吗？

import urllib
from bs4 import BeautifulSoup

url = "http://www.cnn.com"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.decompose()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

Answer 2

回答by bumpkin

To prevent encoding errors at the end...

为了防止最后出现编码错误...

import urllib
from bs4 import BeautifulSoup

url = url
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text.encode('utf-8'))

Python BeatifulSoup4 get_text 仍然有 javascript

提问by KVISH

采纳答案by Hugh Bothwell

回答by bumpkin

相关推荐

最近更新

标签

Python BeatifulSoup4 get_text 仍然有 javascript

提问by KVISH

采纳答案by Hugh Bothwell

回答by bumpkin

相关推荐

Python Pandas read_csv 跳过行但保留标题

何时在 python 中使用 if vs elif

如何在python中打印大数的所有数字？

如何避免 HTTP 错误 429（请求过多）python

相关推荐

最近更新

标签