Python 用于网页抓取的 Selenium 与 BeautifulSoup
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17436014/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Selenium versus BeautifulSoup for web scraping
提问by elie
I'm scraping content from a website using Python. First I used BeautifulSoup
and Mechanize
on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium
.
我正在使用 Python 从网站上抓取内容。首先我在 Python 上使用BeautifulSoup
和Mechanize
但我看到该网站有一个按钮可以通过 JavaScript 创建内容,所以我决定使用Selenium
.
Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath
, what reason is there to use BeautifulSoup
when I could just use Selenium for everything?
鉴于我可以使用 Selenium 和诸如 之类的方法找到元素并获取它们的内容driver.find_element_by_xpath
,BeautifulSoup
当我可以将 Selenium 用于所有事情时,还有什么理由使用它?
And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?
在这种特殊情况下,我需要使用 Selenium 来单击 JavaScript 按钮,那么使用 Selenium 解析更好还是应该同时使用 Selenium 和 Beautiful Soup?
采纳答案by Mark Amery
Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requestsor the built-in urllib.request
) with lxml
or BeautifulSoup
, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:
直接回答你的问题之前,它的值得一说为出发点:如果你需要做的是从静态的HTML页面拉内容,你应该使用一个HTTP库(如请求或内置urllib.request
)用lxml
或BeautifulSoup
不硒(尽管硒可能也足够了)。不不必要地使用 Selenium 的优点:
- Bandwidth, and time to run your script. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
- Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using
requests
. - Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.
- 带宽和运行脚本的时间。使用 Selenium 意味着获取在浏览器中访问页面时通常会获取的所有资源 - 样式表、脚本、图像等。这可能是不必要的。
- 稳定性和易于错误恢复。Selenium 可能有点脆弱,根据我的经验 - 即使使用 PhantomJS - 并且创建架构以杀死挂起的 Selenium 实例并创建一个新实例比在使用
requests
. - 潜在的 CPU 和内存使用量 - 根据您正在爬行的站点以及您尝试并行运行的蜘蛛线程数量,可以想象 DOM 布局逻辑或 JavaScript 执行可能会变得非常昂贵。
Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.
请注意,需要 cookie 才能运行的站点并不是突破 Selenium 的理由——您可以轻松创建一个 URL 打开函数,该函数使用cookielib/ cookiejar神奇地设置和发送带有 HTTP 请求的cookie。
Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:
好的,那么您为什么要考虑使用 Selenium?几乎完全是为了处理您要抓取的内容通过 JavaScript 添加到页面而不是烘焙到 HTML 中的情况。即便如此,您也可以在不破坏重型机械的情况下获得所需的数据。通常这些场景之一适用:
- JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
- The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.
- 与页面一起提供的 JavaScript 已将内容嵌入其中。JavaScript 只是用来做模板或其他将内容放入页面的 DOM 操作。在这种情况下,您可能想看看是否有一种简单的方法可以使用正则表达式直接从 JavaScript 中提取您感兴趣的内容。
- JavaScript 正在使用 Web API 来加载内容。在这种情况下,请考虑是否可以识别相关的 API URL 并自行点击它们;这可能比实际运行 JavaScript 并从网页上抓取内容更简单、更直接。
If you dodecide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.
如果您确实确定使用 Selenium 的情况值得,请在无头模式下使用它,(至少)Firefox 和 Chrome 驱动程序支持该模式。网络爬虫通常不需要实际以图形方式呈现页面,或使用任何特定于浏览器的怪癖或功能,因此无头浏览器 - 具有较低的 CPU 和内存成本以及较少的移动部件会崩溃或挂起 - 是理想的选择。
回答by Rio
I used Selenium for web scrapping, but it is not happysolution. In my last project I used https://github.com/chromedp/chromedp. It is more simple solution than Selenium.
我使用 Selenium 进行网页抓取,但这不是一个令人满意的解决方案。在我的上一个项目中,我使用了https://github.com/chromedp/chromedp。它是比 Selenium 更简单的解决方案。
回答by LiamooT
I would recommend using Selenium for things such as interacting with web pages whether it is in a full blown browser, or a browser in headless mode, such as headless Chrome. I would also like to say that beautiful soup is better for observing and writing statements that rely on if an element is found or WHAT is found, and then using selenium ot execute interactive tasks with the page if the user desires.
我建议使用 Selenium 进行诸如与网页交互之类的事情,无论是在成熟的浏览器中,还是在无头模式下的浏览器(例如无头 Chrome)中。我还想说,beautiful soup 更适合观察和编写依赖于是否找到元素或找到什么的语句,然后使用 selenium 执行与页面的交互任务(如果用户需要)。