Java程序中的Google搜索示例
时间:2020-02-23 14:41:22 来源:igfitidea点击:
Google拥有一个Web搜索API,但是很早以前就已经弃用了它,现在没有标准的方法可以实现此目的。
基本上,谷歌搜索是一个HTTP GET请求,其中查询参数是URL的一部分,并且我们之前已经看到有许多不同的选项(例如Java HttpUrlConnection或者Apache HttpClient)来执行此搜索。
但是问题更多与解析HTML响应并从中获取有用信息有关。
因此,我选择使用jsoup,它是一个开放源代码HTML解析器,并且能够从给定的URL提取HTML。
因此,以下是一个简单的程序,可在Java程序中获取Google搜索结果,然后对其进行解析以找出搜索结果。
package com.theitroad.jsoup; import java.io.IOException; import java.util.Scanner; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class GoogleSearchJava { public static final String GOOGLE_SEARCH_URL = "https://www.google.com/search"; public static void main(String[] args) throws IOException { //Taking search term input from console Scanner scanner = new Scanner(System.in); System.out.println("Please enter the search term."); String searchTerm = scanner.nextLine(); System.out.println("Please enter the number of results. Example: 5 10 20"); int num = scanner.nextInt(); scanner.close(); String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num; //without proper User-Agent, we will get 403 error Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get(); //below will print HTML data, save it to a file and open in browser to compare //System.out.println(doc.html()); //If google search results HTML change the <h3 class="r" to <h3 class="r1" //we need to change below accordingly Elements results = doc.select("h3.r > a"); for (Element result : results) { String linkHref = result.attr("href"); String linkText = result.text(); System.out.println("Text::" + linkText + ", URL::" + linkHref.substring(6, linkHref.indexOf("&"))); } } }
以下是上述程序的输出示例,我将HTML数据保存到文件中,并在浏览器中打开以确认输出,这就是我们想要的。
将输出与下图进行比较。
Please enter the search term. theitroad Please enter the number of results. Example: 5 10 20 20 Text::theitroad, URL::=https://www.theitroad.local/ Text::Java Interview Questions, URL::=https://www.theitroad.local/java-interview-questions Text::Java design patterns, URL::=https://www.theitroad.local/tag/java-design-patterns Text::Tutorials, URL::=https://www.theitroad.local/tutorials Text::Java servlet, URL::=https://www.theitroad.local/tag/java-servlet Text::Spring Framework Tutorial ..., URL::=https://www.theitroad.local/2888/spring-tutorial-spring-core-tutorial Text::Java Design Patterns PDF ..., URL::=https://www.theitroad.local/6308/java-design-patterns-pdf-ebook-free-download-130-pages Text::hyman Kumar (@theitroad) | Twitter, URL::=https://twitter.com/theitroad Text::theitroad | Facebook, URL::=https://www.facebook.com/theitroad Text::theitroad - Chrome Web Store - Google, URL::=https://chrome.google.com/webstore/detail/theitroad/ckdhakodkbphniaehlpackbmhbgfmekf Text::Debian -- Details of package libsystemd-journal-dev in wheezy, URL::=https://packages.debian.org/wheezy/libsystemd-journal-dev Text::Debian -- Details of package libsystemd-journal-dev in wheezy ..., URL::=https://packages.debian.org/wheezy-backports/libsystemd-journal-dev Text::Debian -- Details of package libsystemd-journal-dev in sid, URL::=https://packages.debian.org/sid/libsystemd-journal-dev Text::Debian -- Details of package libsystemd-journal-dev in jessie, URL::=https://packages.debian.org/jessie/libsystemd-journal-dev Text::Ubuntu – Details of package libsystemd-journal-dev in trusty, URL::=https://packages.ubuntu.com/trusty/libsystemd-journal-dev Text::libsystemd-journal-dev : Utopic (14.10) : Ubuntu - Launchpad, URL::=https://launchpad.net/ubuntu/utopic/%2Bpackage/libsystemd-journal-dev Text::Debian -- Details of package libghc-libsystemd-journal-dev in jessie, URL::=https://packages.debian.org/jessie/libghc-libsystemd-journal-dev Text::Advertise on theitroad | BuySellAds, URL::=https://buysellads.com/buy/detail/231824 Text::theitroad | LinkedIn, URL::=https://www.linkedin.com/groups/theitroad-6748558 Text::How to install libsystemd-journal-dev package in Ubuntu Trusty, URL::=https://www.howtoinstall.co/en/ubuntu/trusty/main/libsystemd-journal-dev/ Text::[global] auth supported = cephx ms bind ipv6 = true [mon] mon data ..., URL::=https://zooi.widodh.nl/ceph/ceph.conf Text::UbuntuUpdates - Package "libsystemd-journal-dev" (trusty 14.04), URL::=https://www.ubuntuupdates.org/libsystemd-journal-dev Text::[Journal]Dev'err - Cursus Honorum - Enjin, URL::=https://cursushonorum.enjin.com/holonet/m/23958869/viewthread/13220130-theitroaderr/post/last