Python Scrapy 是否可以从原始 HTML 数据中获取纯文本?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17721782/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Is it possible for Scrapy to get plain text from raw HTML data?
提问by inix
For example:
例如:
scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content
Then, I get the following raw HTML code:
然后,我得到以下原始 HTML 代码:
<div id="content">
<h2>Welcome to Scrapy</h2>
<h3>What is Scrapy?</h3>
<p>Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from their
pages. It can be used for a wide range of purposes, from data mining to
monitoring and automated testing.</p>
<h3>Features</h3>
<dl>
<dt>Simple</dt>
<dt>
</dt>
<dd>Scrapy was designed with simplicity in mind, by providing the features
you need without getting in your way
</dd>
<dt>Productive</dt>
<dd>Just write the rules to extract the data from web pages and let Scrapy
crawl the entire web site for you
</dd>
<dt>Fast</dt>
<dd>Scrapy is used in production crawlers to completely scrape more than
500 retailer sites daily, all in one server
</dd>
<dt>Extensible</dt>
<dd>Scrapy was designed with extensibility in mind and so it provides
several mechanisms to plug new code without having to touch the framework
core
</dd>
<dt>Portable, open-source, 100% Python</dt>
<dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>
<dt>Batteries included</dt>
<dd>Scrapy comes with lots of functionality built in. Check <a
href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
section</a> of the documentation for a list of them.
</dd>
<dt>Well-documented & well-tested</dt>
<dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite
with <a href="http://static.scrapy.org/coverage-report/">very good code
coverage</a></dd>
<dt><a href="/community">Healthy community</a></dt>
<dd>
1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>
700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>
850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>
200 messages per month on mailing list (<a
href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>
40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)
</dd>
<dt><a href="/support">Commercial support</a></dt>
<dd>A few companies provide Scrapy consulting and support</dd>
<p>Still not sure if Scrapy is what you're looking for?. Check out <a
href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
glance</a>.
</p>
<h3>Companies using Scrapy</h3>
<p>Scrapy is being used in large production environments, to crawl
thousands of sites daily. Here is a list of <a href="/companies/">Companies
using Scrapy</a>.</p>
<h3>Where to start?</h3>
<p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,
then <a href="/download/">download Scrapy</a> and follow the <a
href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.
</p></dl>
</div>
But I want to get plain textdirectly from scrapy.
但我想直接从scrapy获取纯文本。
I do not want to use any xPath selectors to extract the p
, h2
, h3
... tags, since I am crawling a website whose main content is embedded into a table
, tbody
; recursively. It can be a tedious task to find the xPath.
我不想使用任何 xPath 选择器来提取p
, h2
, h3
... 标签,因为我正在抓取一个主要内容嵌入到table
, 中的网站tbody
;递归地。查找 xPath 可能是一项乏味的任务。
Can this be implemented by a built in function in Scrapy? Or do I need external tools to convert it? I have read through all of Scrapy's docs, but have gained nothing.
这可以通过 Scrapy 中的内置函数实现吗?或者我需要外部工具来转换它吗?我已经通读了 Scrapy 的所有文档,但一无所获。
This is a sample site which can convert raw HTML into plain text: http://beaker.mailchimp.com/html-to-text
这是一个可以将原始 HTML 转换为纯文本的示例站点:http: //beaker.mailchimp.com/html-to-text
采纳答案by alecxe
Scrapy doesn't have such functionality built-in. html2textis what you are looking for.
Scrapy 没有内置这样的功能。html2text就是你要找的。
Here's a sample spider that scrapes wikipedia's python page, gets first paragraph using xpath and converts html into plain text using html2text
:
这是一个示例蜘蛛,它抓取维基百科的 python 页面,使用 xpath 获取第一段并使用以下方法将 html 转换为纯文本html2text
:
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text
class WikiSpider(BaseSpider):
name = "wiki_spider"
allowed_domains = ["www.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]
converter = html2text.HTML2Text()
converter.ignore_links = True
print(converter.handle(sample)) #Python 3 print syntax
prints:
印刷:
**Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]
**Python** 是一种广泛使用的通用高级编程语言。[11][12][13] 它的设计理念强调代码的可读性,它的语法允许程序员用比 C 等语言更少的代码行来表达概念。 [14] [15] 该语言提供了旨在在小型和大型范围内实现清晰程序的结构。 [16]
回答by paul trmbrth
Another solution using lxml.html
's tostring()
with parameter method="text"
. lxml
is used in Scrapy internally. (parameter encoding=unicode
is usually what you want.)
使用lxml.html
's tostring()
with parameter 的另一种解决方案method="text"
。lxml
在 Scrapy 内部使用。(参数encoding=unicode
通常是你想要的。)
See http://lxml.de/api/lxml.html-module.htmlfor details.
有关详细信息,请参阅http://lxml.de/api/lxml.html-module.html。
from scrapy.spider import BaseSpider
import lxml.etree
import lxml.html
class WikiSpider(BaseSpider):
name = "wiki_spider"
allowed_domains = ["www.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]
def parse(self, response):
root = lxml.html.fromstring(response.body)
# optionally remove tags that are not usually rendered in browsers
# javascript, HTML/HEAD, comments, add the tag names you dont want at the end
lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")
# complete text
print lxml.html.tostring(root, method="text", encoding=unicode)
# or same as in alecxe's example spider,
# pinpoint a part of the document using XPath
#for p in root.xpath("//div[@id='mw-content-text']/p[1]"):
# print lxml.html.tostring(p, method="text")
回答by Reyraa
At this moment, I don't think you need to install any 3rd party library. scrapy providesthis functionality using selectors:
Assume this complex selector:
目前,我认为您不需要安装任何 3rd 方库。scrapy使用选择器提供此功能:
假设这个复杂的选择器:
sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')
we can get the entire text using:
我们可以使用以下方法获取整个文本:
text_content = sel.xpath("//a[1]//text()").extract()
# which results [u'Click here to go to the ', u'Next Page']
then you can join them together easily:
然后您可以轻松地将它们连接在一起:
' '.join(text_content)
# Click here to go to the Next Page