网页抓取选项 - 仅限 C++ 版本

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/834768/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 17:30:03  来源:igfitidea点击:

Options for web scraping - C++ version only

c++screen-scraping

提问by Piotr Dobrogost

I'm looking for a good C++ library for web scraping.
It has tobe C/C++ and nothingelse so please do not direct me to Options for HTML scrapingor other SO questions/answers where C++ is not even mentioned.

我正在寻找一个很好的 C++ 库来进行网页抓取。
必须是 C/C++ 而没有别的,所以请不要将我引导到HTML 抓取选项或其他甚至没有提到 C++ 的 SO 问题/答案。

回答by Kyle Simek

回答by Halcyon

Use myhtml C/C++ parser here; dead simple, veryfast. No dependencies except C99. And has CSS selectors built in (example here)

在这里使用 myhtml C/C++ 解析器;死简单,非常快。除了 C99 之外没有依赖项。并且内置了 CSS 选择器(示例在这里

回答by DanielB

// download winhttpclient.h
// --------------------------------
#include <winhttp\WinHttpClient.h>
using namespace std;
typedef unsigned char byte;
#define foreach         BOOST_FOREACH
#define reverse_foreach BOOST_REVERSE_FOREACH

bool substrexvealue(const std::wstring& html,const std::string& tg1,const std::string& tg2,std::string& value, long& next) {
    long p1,p2;
    std::wstring wtmp;
    std::wstring wtg1(tg1.begin(),tg1.end());
    std::wstring wtg2(tg2.begin(),tg2.end());

    p1=html.find(wtg1,next);
    if(p1!=std::wstring::npos) {
        p2=html.find(wtg2,next);
        if(p2!=std::wstring::npos) {
            p1+=wtg1.size();
            wtmp=html.substr(p1,p2-p1-1);
            value=std::string(wtmp.begin(),wtmp.end());
            boost::trim(value);
            next=p1+1;
        }
    }
    return p1!=std::wstring::npos;
}
bool extractvalue(const std::wstring& html,const std::string& tag,std::string& value, long& next) {
    long p1,p2,p3;
    std::wstring wtmp;
    std::wstring wtag(tag.begin(),tag.end());

    p1=html.find(wtag,next);
    if(p1!=std::wstring::npos) {
        p2=html.find(L">",p1+wtag.size()-1);
        p3=html.find(L"<",p2+1);
        wtmp=html.substr(p2+1,p3-p2-1);
        value=std::string(wtmp.begin(),wtmp.end());
        boost::trim(value);
        next=p1+1;
    }
    return p1!=std::wstring::npos;
}
bool GetHTML(const std::string& url,std::wstring& header,std::wstring& hmtl) {
    std::wstring wurl = std::wstring(url.begin(),url.end());
    bool ret=false;
    try {
        WinHttpClient client(wurl.c_str());
        std::string url_protocol=url.substr(0,5);
        std::transform(url_protocol.begin(), url_protocol.end(), url_protocol.begin(), (int (*)(int))std::toupper);
        if(url_protocol=="HTTPS")    client.SetRequireValidSslCertificates(false);
        client.SetUserAgent(L"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
        if(client.SendHttpRequest()) {
            header = client.GetResponseHeader();
            hmtl = client.GetResponseContent();
            ret=true;
        }
    }catch(...) {
        header=L"Error";
        hmtl=L"";
    }
    return ret;
}
int main() {
    std::string url = "http://www.google.fr";
    std::wstring header,html;
    GetHTML(url,header,html));
}

回答by StereoMatching

I recommend Qt5.6.2, this powerful library offer us

我推荐 Qt5.6.2,这个强大的库为我们提供

  1. High level, intuitive, asynchronous network api like QNetworkAccessManager, QNetworkReply, QNetworkProxy etc
  2. Powerful regex class like QRegularExpression
  3. Decent web engine like QtWebEngine
  4. Robust, mature gui like QWidgets
  5. Most of the Qt5 api are well designed, signal and slot make writing asynchronous codes become much easier too
  6. Great unicode support
  7. Feature rich file system library. Whether create, remove, rename or find standard path to save files is piece of cake in Qt5
  8. Asynchronous api of QNetworkAccessManager make it easy to spawn many download request at once
  9. Cross major desktop platforms, windows, mac os and linux, write once compiled anywhere, one code bases only.
  10. Easy to deploy on windows and mac(linux?maybe linuxdeployqt can save us tons of troubles)
  11. Easy to install on windows, mac and linux
  12. And so on
  1. 高级、直观、异步的网络 api,如 QNetworkAccessManager、QNetworkReply、QNetworkProxy 等
  2. 强大的正则表达式类,如 QRegularExpression
  3. 像 QtWebEngine 这样的不错的 Web 引擎
  4. 像 QWidgets 一样健壮、成熟的 gui
  5. Qt5 的大部分 api 都经过精心设计,信号和槽也让编写异步代码变得更加容易
  6. 出色的 Unicode 支持
  7. 功能丰富的文件系统库。在 Qt5 中创建、删除、重命名或查找保存文件的标准路径都是小菜一碟
  8. QNetworkAccessManager 的异步 api 可以轻松地一次产生多个下载请求
  9. 跨主要桌面平台,windows、mac os 和 linux,一次编译随处编写,仅一个代码库。
  10. 易于在 windows 和 mac 上部署(linux?也许 linuxdeployqt 可以为我们省去很多麻烦)
  11. 易于安装在 windows、mac 和 linux 上
  12. 等等

I already wrote an image scraper apps by Qt5, this app can scrape almost every image searched by Google, Bing and Yahoo.

我已经用 Qt5 写了一个图像抓取应用程序,这个应用程序几乎可以抓取谷歌、必应和雅虎搜索的所有图像。

To know more details about it, please visit my github project. I wrote down high level overview about how to scrape data by Qt5 on my blogs(it is too long to post at stack overflow).

要了解更多详细信息,请访问我的 github 项目。我在我的博客上写下了关于如何通过 Qt5 抓取数据的高级概述(在堆栈溢出时发布太长了)。