C++ 以编程方式读取网页

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/389069/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-27 15:04:58  来源:igfitidea点击:

Programmatically reading a web page

c++chttp

提问by Howard May

I want to write a program in C/C++ that will dynamically read a web page and extract information from it. As an example imagine if you wanted to write an application to follow and log an ebay auction. Is there an easy way to grab the web page? A library which provides this functionality? And is there an easy way to parse the page to get the specific data?

我想用 C/C++ 编写一个程序,该程序将动态读取网页并从中提取信息。例如,假设您想编写一个应用程序来关注和记录 ebay 拍卖。有没有一种简单的方法来抓取网页?提供此功能的库?是否有一种简单的方法来解析页面以获取特定数据?

回答by Gant

Have a look at the cURL library:

看看cURL 库

 #include <stdio.h>
 #include <curl/curl.h>

 int main(void)
 {
   CURL *curl;
   CURLcode res;

   curl = curl_easy_init();
   if(curl) {
     curl_easy_setopt(curl, CURLOPT_URL, "curl.haxx.se");
     res = curl_easy_perform(curl);
      /* always cleanup */
    curl_easy_cleanup(curl);
   }
   return 0;
 }

BTW, if C++ is not strictly required. I encourage you to try C# or Java. It is much easier and there is a built-in way.

顺便说一句,如果不严格要求 C++。我鼓励您尝试 C# 或 Java。这要容易得多,而且有一种内置的方法。

回答by Software_Designer

Windows code:

窗口代码:

#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")
using namespace std;
int main (){
    WSADATA wsaData;
    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
        cout << "WSAStartup failed.\n";
        system("pause");
        return 1;
    }
    SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
    struct hostent *host;
    host = gethostbyname("www.google.com");
    SOCKADDR_IN SockAddr;
    SockAddr.sin_port=htons(80);
    SockAddr.sin_family=AF_INET;
    SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);
    cout << "Connecting...\n";
    if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0){
        cout << "Could not connect";
        system("pause");
        return 1;
    }
    cout << "Connected.\n";
    send(Socket,"GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n", strlen("GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n"),0);
    char buffer[10000];
    int nDataLength;
    while ((nDataLength = recv(Socket,buffer,10000,0)) > 0){        
        int i = 0;
        while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r') {
            cout << buffer[i];
            i += 1;
        }
    }
    closesocket(Socket);
        WSACleanup();
    system("pause");
    return 0;
}

回答by Rob

There is a free TCP/IP library available for Windows that supports HTTP and HTTPS - using it is very straightforward.

有一个免费的 TCP/IP 库可用于支持 HTTP 和 HTTPS 的 Windows - 使用它非常简单。

Ultimate TCP/IP

终极 TCP/IP

CUT_HTTPClient http;
http.GET("http://folder/file.htm", "c:/tmp/process_me.htm");    

You can also GET files and store them in a memory buffer (via CUT_DataSourcederived classes). All the usual HTTP support is there - PUT, HEAD, etc. Support for proxy servers is a breeze, as are secure sockets.

您还可以 GET 文件并将它们存储在内存缓冲区中(通过CUT_DataSource派生类)。所有常用的 HTTP 支持都在那里——PUT、HEAD 等。对代理服务器的支持是轻而易举的,安全套接字也是如此。

回答by Johann Gerell

You're not mentioning any platform, so I give you an answer for Win32.

你没有提到任何平台,所以我给你一个Win32的答案。

One simple way to download anything from the Internet is the URLDownloadToFilewith the IBindStatusCallbackparameter set to NULL. To make the function more useful, the callback interface needs to be implemented.

从 Internet 下载任何内容的一种简单方法是URLDownloadToFileIBindStatusCallback参数设置为NULL。为了使函数更有用,需要实现回调接口。

回答by Diomidis Spinellis

You can do it with socket programming, but it's tricky to implement the parts of the protocol needed to reliably fetch a page. Better to use a library, like neon. This is likely to be installed in most Linux distributions. Under FreeBSD use the fetch library.

您可以通过套接字编程来实现,但是实现可靠地获取页面所需的协议部分是很棘手的。最好使用像neon这样的库。这很可能安装在大多数 Linux 发行版中。在 FreeBSD 下使用 fetch 库。

For parsing the data, because many pages don't use valid XML, you need to implement heuristics, not a real yacc-based parser. You can implement these using regular expressions or a state transition machine. As what you're trying to do involves a lot of trial-and-error you're better off using a scripting language, like Perl. Due to the high network latency you will not see any difference in performance.

对于解析数据,因为许多页面不使用有效的 XML,您需要实现启发式,而不是真正的基于 yacc 的解析器。您可以使用正则表达式或状态转换机来实现这些。由于您尝试做的事情涉及大量反复试验,因此最好使用脚本语言,例如 Perl。由于高网络延迟,您不会看到任何性能差异。

回答by Marius

Try using a library, like Qt, which can read data from across a network and get data out of an xml document. Thisis an example of how to read an xml feed. You could use the ebay feed for example.

尝试使用库,如 Qt,它可以从网络读取数据并从 xml 文档中获取数据。是如何读取 xml 提要的示例。例如,您可以使用 ebay 提要。