使用 Python 进行 Google 搜索网页抓取

Question

提问by pbell

I've been learning a lot of python lately to work on some projects at work.

我最近一直在学习很多 python 来处理一些工作中的项目。

Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?

目前我需要用谷歌搜索结果做一些网络抓取。我找到了几个演示如何使用 ajax google api 进行搜索的站点，但是在尝试使用它之后，它似乎不再受支持。有什么建议？

I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.

我一直在寻找一种方法，但似乎无法找到任何当前有效的解决方案。

Answer 1

采纳答案by StuxCrystal

You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query>this will return the top 10 search results.

您可以随时直接抓取 Google 结果。为此，您可以使用 URLhttps://google.com/search?q=<Query>这将返回前 10 个搜索结果。

Then you can use lxmlfor example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[@class="r"]/a)

然后你可以使用lxml来解析页面。根据您使用的内容，您可以通过 CSS-Selector ( .r a) 或使用 XPath-Selector ( //h3[@class="r"]/a)查询结果节点树

In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.

在某些情况下，生成的 URL 会重定向到 Google。通常它包含一个查询参数q，它将包含实际的请求 URL。

Example code using lxml and requests:

使用 lxml 和请求的示例代码：

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

A note on google banning your IP: In my experience, google only bans if you start spamming google with search requests. It will respond with a 503 if Google thinks you are bot.

关于谷歌禁止您的 IP 的说明：根据我的经验，只有当您开始向谷歌发送搜索请求时，谷歌才会禁止。如果 Google 认为您是机器人，它将以 503 响应。

Answer 2

回答by LeitnerChristoph

Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.

这是另一种可用于抓取 SERP 的服务 ( https://zenserp.com) 它不需要客户端并且更便宜。

Here is a python code sample:

这是一个python代码示例：

import requests

headers = {
    'apikey': '',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)

Answer 3

回答by Hartator

You can also use a third party service like Serp APIthat is a Google search engine results. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.

您还可以使用第三方服务，例如作为 Google 搜索引擎结果的Serp API。解决了被屏蔽的问题，不用租代理自己解析结果。

It's easy to integrate with Python:

与 Python 集成很容易：

from lib.google_search_results import GoogleSearchResults

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()

GitHub: https://github.com/serpapi/google-search-results-python

使用 Python 进行 Google 搜索网页抓取

提问by pbell

采纳答案by StuxCrystal

回答by LeitnerChristoph

回答by Hartator

相关推荐

最近更新

标签

使用 Python 进行 Google 搜索网页抓取

提问by pbell

采纳答案by StuxCrystal

回答by LeitnerChristoph

回答by Hartator

相关推荐

如何为python单元测试提供模拟类方法？

Python 从字典中返回最大值

重新安装 python 2.7.12 和 python 3.5.2

Python 返回第一个匹配正则表达式的字符串

相关推荐

最近更新

标签