使用 Python 进行 Google 搜索网页抓取

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38619478/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:12:50  来源:igfitidea点击:

Google Search Web Scraping with Python

pythonpython-2.7google-searchgoogle-search-api

提问by pbell

I've been learning a lot of python lately to work on some projects at work.

我最近一直在学习很多 python 来处理一些工作中的项目。

Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?

目前我需要用谷歌搜索结果做一些网络抓取。我找到了几个演示如何使用 ajax google api 进行搜索的站点,但是在尝试使用它之后,它似乎不再受支持。有什么建议?

I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.

我一直在寻找一种方法,但似乎无法找到任何当前有效的解决方案。

采纳答案by StuxCrystal

You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query>this will return the top 10 search results.

您可以随时直接抓取 Google 结果。为此,您可以使用 URLhttps://google.com/search?q=<Query>这将返回前 10 个搜索结果。

Then you can use lxmlfor example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[@class="r"]/a)

然后你可以使用lxml来解析页面。根据您使用的内容,您可以通过 CSS-Selector ( .r a) 或使用 XPath-Selector ( //h3[@class="r"]/a)查询结果节点树

In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.

在某些情况下,生成的 URL 会重定向到 Google。通常它包含一个查询参数q,它将包含实际的请求 URL。

Example code using lxml and requests:

使用 lxml 和请求的示例代码:

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

A note on google banning your IP: In my experience, google only bans if you start spamming google with search requests. It will respond with a 503 if Google thinks you are bot.

关于谷歌禁止您的 IP 的说明:根据我的经验,只有当您开始向谷歌发送搜索请求时,谷歌才会禁止。如果 Google 认为您是机器人,它将以 503 响应。

回答by LeitnerChristoph

Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.

这是另一种可用于抓取 SERP 的服务 ( https://zenserp.com) 它不需要客户端并且更便宜。

Here is a python code sample:

这是一个python代码示例:

import requests

headers = {
    'apikey': '',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)

回答by Hartator

You can also use a third party service like Serp APIthat is a Google search engine results. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.

您还可以使用第三方服务,例如作为 Google 搜索引擎结果的Serp API。解决了被屏蔽的问题,不用租代理自己解析结果。

It's easy to integrate with Python:

与 Python 集成很容易:

from lib.google_search_results import GoogleSearchResults

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()

GitHub: https://github.com/serpapi/google-search-results-python

GitHub: https://github.com/serpapi/google-search-results-python