Python Scrapy 是否可以从原始 HTML 数据中获取纯文本？

Question

提问by inix

For example:

例如：

scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content

Then, I get the following raw HTML code:

然后，我得到以下原始 HTML 代码：

<div id="content">


  <h2>Welcome to Scrapy</h2>

  <h3>What is Scrapy?</h3>

  <p>Scrapy is a fast high-level screen scraping and web crawling
    framework, used to crawl websites and extract structured data from their
    pages. It can be used for a wide range of purposes, from data mining to
    monitoring and automated testing.</p>

  <h3>Features</h3>

  <dl>

    <dt>Simple</dt>
    <dt>
    </dt>
    <dd>Scrapy was designed with simplicity in mind, by providing the features
      you need without getting in your way
    </dd>

    <dt>Productive</dt>
    <dd>Just write the rules to extract the data from web pages and let Scrapy
      crawl the entire web site for you
    </dd>

    <dt>Fast</dt>
    <dd>Scrapy is used in production crawlers to completely scrape more than
      500 retailer sites daily, all in one server
    </dd>

    <dt>Extensible</dt>
    <dd>Scrapy was designed with extensibility in mind and so it provides
      several mechanisms to plug new code without having to touch the framework
      core

    </dd>
    <dt>Portable, open-source, 100% Python</dt>
    <dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>

    <dt>Batteries included</dt>
    <dd>Scrapy comes with lots of functionality built in. Check <a
        href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
      section</a> of the documentation for a list of them.
    </dd>

    <dt>Well-documented &amp; well-tested</dt>
    <dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite
      with <a href="http://static.scrapy.org/coverage-report/">very good code
        coverage</a></dd>

    <dt><a href="/community">Healthy community</a></dt>
    <dd>
      1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>
      700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>
      850 questions on StackOverflow (<a href="http://stackoverflow.com/tags/scrapy/info">link</a>)<br>
      200 messages per month on mailing list (<a
        href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>
      40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)
    </dd>

    <dt><a href="/support">Commercial support</a></dt>
    <dd>A few companies provide Scrapy consulting and support</dd>

    <p>Still not sure if Scrapy is what you're looking for?. Check out <a
        href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
      glance</a>.

    </p>
    <h3>Companies using Scrapy</h3>

    <p>Scrapy is being used in large production environments, to crawl
      thousands of sites daily. Here is a list of <a href="/companies/">Companies
        using Scrapy</a>.</p>

    <h3>Where to start?</h3>

    <p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,
      then <a href="/download/">download Scrapy</a> and follow the <a
          href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.


    </p></dl>
</div>

But I want to get plain textdirectly from scrapy.

但我想直接从scrapy获取纯文本。

I do not want to use any xPath selectors to extract the p, h2, h3... tags, since I am crawling a website whose main content is embedded into a table, tbody; recursively. It can be a tedious task to find the xPath.

我不想使用任何 xPath 选择器来提取p, h2, h3... 标签，因为我正在抓取一个主要内容嵌入到table, 中的网站tbody；递归地。查找 xPath 可能是一项乏味的任务。

Can this be implemented by a built in function in Scrapy? Or do I need external tools to convert it? I have read through all of Scrapy's docs, but have gained nothing.

这可以通过 Scrapy 中的内置函数实现吗？或者我需要外部工具来转换它吗？我已经通读了 Scrapy 的所有文档，但一无所获。

This is a sample site which can convert raw HTML into plain text: http://beaker.mailchimp.com/html-to-text

这是一个可以将原始 HTML 转换为纯文本的示例站点：http: //beaker.mailchimp.com/html-to-text

Answer 1

采纳答案by alecxe

Scrapy doesn't have such functionality built-in. html2textis what you are looking for.

Scrapy 没有内置这样的功能。html2text就是你要找的。

Here's a sample spider that scrapes wikipedia's python page, gets first paragraph using xpath and converts html into plain text using html2text:

这是一个示例蜘蛛，它抓取维基百科的 python 页面，使用 xpath 获取第一段并使用以下方法将 html 转换为纯文本html2text：

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text


class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]

        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print(converter.handle(sample)) #Python 3 print syntax

prints:

印刷：

**Python** is a widely used general-purpose, high-level programming language.[11][12][13] Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C.[14][15] The language provides constructs intended to enable clear programs on both a small and large scale.[16]

**Python** 是一种广泛使用的通用高级编程语言。[11][12][13] 它的设计理念强调代码的可读性，它的语法允许程序员用比 C 等语言更少的代码行来表达概念。 [14] [15] 该语言提供了旨在在小型和大型范围内实现清晰程序的结构。 [16]

Answer 2

回答by paul trmbrth

Another solution using lxml.html's tostring()with parameter method="text". lxmlis used in Scrapy internally. (parameter encoding=unicodeis usually what you want.)

使用lxml.html's tostring()with parameter 的另一种解决方案method="text"。lxml在 Scrapy 内部使用。（参数encoding=unicode通常是你想要的。）

See http://lxml.de/api/lxml.html-module.htmlfor details.

有关详细信息，请参阅http://lxml.de/api/lxml.html-module.html。

from scrapy.spider import BaseSpider
import lxml.etree
import lxml.html

class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # optionally remove tags that are not usually rendered in browsers
        # javascript, HTML/HEAD, comments, add the tag names you dont want at the end
        lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")

        # complete text
        print lxml.html.tostring(root, method="text", encoding=unicode)

        # or same as in alecxe's example spider,
        # pinpoint a part of the document using XPath
        #for p in root.xpath("//div[@id='mw-content-text']/p[1]"):
        #   print lxml.html.tostring(p, method="text")

Answer 3

回答by Reyraa

At this moment, I don't think you need to install any 3rd party library. scrapy providesthis functionality using selectors:
Assume this complex selector:

目前，我认为您不需要安装任何 3rd 方库。scrapy使用选择器提供此功能：
假设这个复杂的选择器：

sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

we can get the entire text using:

我们可以使用以下方法获取整个文本：

text_content = sel.xpath("//a[1]//text()").extract()
# which results [u'Click here to go to the ', u'Next Page']

then you can join them together easily:

然后您可以轻松地将它们连接在一起：

   ' '.join(text_content)
   # Click here to go to the Next Page

Python Scrapy 是否可以从原始 HTML 数据中获取纯文本？

提问by inix

采纳答案by alecxe

回答by paul trmbrth

回答by Reyraa

相关推荐

最近更新

标签

Python Scrapy 是否可以从原始 HTML 数据中获取纯文本？

提问by inix

采纳答案by alecxe

回答by paul trmbrth

回答by Reyraa

相关推荐

使用拖放进行 Python GUI 编程，还包含标准输出重定向

在python中导入外部“.txt”文件

Python 将包含字符串的 Pandas 系列转换为布尔值

Python 使用 requests.get() 时未提供架构和其他错误

相关推荐

最近更新

标签