Python BeautifulSoup 和 Scrapy 爬虫的区别?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19687421/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Difference between BeautifulSoup and Scrapy crawler?
提问by Nishant Bhakta
I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoupbut not so much with Scrapy crawler.
我想做一个网站,显示亚马逊和 e-bay 产品价格之间的比较。其中哪些会更好,为什么?我对BeautifulSoup有点熟悉,但对Scrapy crawler不太熟悉。
采纳答案by Medeiros
Scrapyis a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.
Scrapy是一个 Web-spider 或网络爬虫框架,你给 Scrapy 一个根 URL 来开始爬取,然后你可以指定你想要爬取和获取的 URL 数量的限制,等等。它是一个完整的 web-scraping 或crawling框架。
While
尽管
BeautifulSoupis a parsing librarywhich also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.
BeautifulSoup是一个解析库,它在从 URL 获取内容方面也做得很好,并允许您轻松解析其中的某些部分。它只获取您提供的 URL 的内容,然后停止。除非您手动将其放入具有特定条件的无限循环中,否则它不会爬行。
In simple words, with Beautiful Soup you can build something similar to Scrapy. Beautiful Soup is a librarywhile Scrapy is a complete framework.
简而言之,使用 Beautiful Soup,您可以构建类似于 Scrapy 的东西。Beautiful Soup 是一个库,而 Scrapy 是一个完整的框架。
回答by rdenadai
I think both are good... im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page. After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.
我认为两者都很好......我现在正在做一个同时使用两者的项目。首先,我使用 scrapy 抓取所有页面,并使用他们的管道将其保存在 mongodb 集合中,同时下载页面上存在的图像。之后,我使用 BeautifulSoup4 进行 pos 处理,我必须更改属性值并获取一些特殊标签。
If you don't know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.
如果你不知道你想要哪个页面的产品,一个好的工具将是scrapy,因为你可以使用他们的爬虫来运行所有亚马逊/ebay 网站来寻找产品,而无需明确的 for 循环。
Take a look at the scrapy documentation, it's very simple to use.
看看scrapy文档,使用起来非常简单。
回答by baldnbad
The way I do it is to use the eBay/Amazon API's rather than scrapy, and then parse the results using BeautifulSoup.
我这样做的方法是使用 eBay/Amazon API 而不是 scrapy,然后使用 BeautifulSoup 解析结果。
The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.
这些 API 为您提供了一种官方方式来获取您从爬虫爬虫中获取的相同数据,而无需担心隐藏您的身份、处理代理等问题。
回答by Arun Augustine
Both are using to parse data.
两者都用于解析数据。
Scrapy:
刮痧:
- Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
- But it has some limitations when data comes from java script or loading dynamicaly, we can over come it by using packages like splash, selenium etc.
- Scrapy 是一个快速的高级网页抓取和网页抓取框架,用于抓取网站并从其页面中提取结构化数据。
- 但是当数据来自java脚本或动态加载时它有一些限制,我们可以通过使用splash、selenium等包来克服它。
BeautifulSoup:
美汤:
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
we can use this package for getting data from java script or dynamically loading pages.
Beautiful Soup 是一个 Python 库,用于从 HTML 和 XML 文件中提取数据。
我们可以使用这个包从 java 脚本中获取数据或动态加载页面。
Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents
Scrapy with BeautifulSoup 是我们可以用来抓取静态和动态内容的最佳组合之一
回答by ethirajit
Using scrapyyou can save tons of code and start with structured programming, If you dont like any of the scapy's pre-written methods then BeautifulSoupcan be used in the place of scrapy method. Big project takes both advantages.
使用scrapy,您可以节省大量代码并从结构化编程开始,如果您不喜欢scapy 的任何预先编写的方法,那么可以使用BeautifulSoup代替scrapy 方法。大项目兼具这两个优势。
回答by krish___na
The differences are many and selection of any tool/technology depends on individual needs.
差异很多,任何工具/技术的选择取决于个人需求。
Few major differences are:
几个主要区别是:
- BeautifulSoup is comparatively is easy to learnthan Scrapy.
- The extensions, support, community is larger for Scrapy than for BeautifulSoup.
- Scrapy should be considered as a Spiderwhile BeautifulSoup is a Parser.
- BeautifulSoup比 Scrapy更容易学习。
- Scrapy 的扩展、支持和社区比 BeautifulSoup 更大。
- Scrapy 应该被视为蜘蛛,而 BeautifulSoup 是解析器。
回答by Amit
ScrapyIt is a web scraping frameworkwhich comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.
Scrapy这是一个网页抓取框架,它带有大量的好东西,使抓取变得更容易,这样我们就可以只专注于抓取逻辑。下面是scrapy为我们提供的一些我最喜欢的东西。
- Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
- Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don't have to wait for a request to finish before sending another request).
- Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.
Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.
Cookies: scrapy automatically handles cookies for us.
- Feed 导出:它基本上允许我们以各种格式保存数据,如 CSV、JSON、jsonlines 和 XML。
- 异步抓取:Scrapy 使用扭曲的框架,这使我们能够一次访问多个 url,其中每个请求都以非阻塞方式处理(基本上我们不必在发送另一个请求之前等待请求完成)。
- 选择器:这是我们可以将scrapy与美丽的汤进行比较的地方。选择器允许我们从网页中选择特定数据,如标题、具有类名的特定 div 等)。Scrapy 使用 lxml 进行解析,这比美汤要快得多。
设置代理、用户代理、标头等:scrapy 允许我们动态设置和旋转代理和其他标头。
项目管道:管道使我们能够在提取后处理数据。例如,我们可以配置管道将数据推送到您的 mysql 服务器。
Cookies:scrapy 会自动为我们处理 cookie。
etc.
等等。
TLDR: scrapy is a framework that provides everything that one might need to build large scale crawls. It provides various features that hide complexity of crawling the webs. one can simply start writing web crawlers without worrying about the setup burden.
TLDR:scrapy 是一个框架,它提供了构建大规模爬网可能需要的一切。它提供了各种隐藏网络爬行复杂性的功能。可以简单地开始编写网络爬虫,而不必担心设置负担。
Beautiful soupBeautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.
BeautifulSoup Beautiful Soup 是一个用于解析 HTML 和 XML 文档的 Python 包。因此,使用 Beautiful Soup,您可以解析已下载的网页。BS4 非常流行和古老。与scrapy不同,你不能只用漂亮的汤来制作爬虫。您将需要其他库,如请求、urllib 等,以使用 bs4 制作爬虫。同样,这意味着您需要管理正在被抓取的 url 列表,被抓取,处理 cookie,管理代理,处理错误,创建您自己的函数以将数据推送到 CSV、JSON、XML 等。如果您想加快速度比您将不得不使用其他库,如multiprocessing。
To sum up.
总结。
Scrapy is a rich framework that you can use to start writing crawlers without any hassale.
Beautiful soup is a library that you can use to parse a webpage. It cannot be used alone to scrape web.
Scrapy 是一个丰富的框架,您可以使用它轻松开始编写爬虫程序。
Beautiful Soup 是一个可以用来解析网页的库。它不能单独用于刮网。
You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.
你绝对应该为你的亚马逊和 e-bay 产品价格比较网站使用scrapy。您可以构建一个 url 数据库并每天运行爬虫(cron 作业,Celery 用于安排爬网)并更新数据库上的价格。这样您的网站将始终从数据库中提取,爬虫和数据库将充当单独的组件。
回答by Jaskaran Singh
BeautifulSoupis a library that lets you extract information from a web page.
BeautifulSoup是一个库,可让您从网页中提取信息。
Scrapyon the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.
另一方面,Scrapy是一个框架,它可以完成上述事情以及您在抓取项目中可能需要的更多事情,例如用于保存数据的管道。
You can check this blog to get started with Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/
您可以查看此博客以开始使用 Scrapy https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/