C# HTTPWebResponse + StreamReader 非常慢

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/901323/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-05 22:20:57  来源:igfitidea点击:

HTTPWebResponse + StreamReader Very Slow

c#performanceweb-crawlerhttpwebresponsestreamreader

提问by Roey

I'm trying to implement a limited web crawler in C# (for a few hundred sites only) using HttpWebResponse.GetResponse() and Streamreader.ReadToEnd() , also tried using StreamReader.Read() and a loop to build my HTML string.

我正在尝试使用 HttpWebResponse.GetResponse() 和 Streamreader.ReadToEnd() 在 C# 中实现有限的网络爬虫(仅限几百个站点),还尝试使用 StreamReader.Read() 和循环来构建我的 HTML 字符串。

I'm only downloading pages which are about 5-10K.

我只下载大约 5-10K 的页面。

It's all very slow! For example, the average GetResponse() time is about half a second, while the average StreamREader.ReadToEnd() time is about 5 seconds!

这一切都非常缓慢!例如,平均 GetResponse() 时间约为半秒,而平均 StreamREader.ReadToEnd() 时间约为 5 秒!

All sites should be very fast, as they are very close to my location, and have fast servers. (in Explorer takes practically nothing to D/L) and I am not using any proxy.

所有站点都应该非常快,因为它们离我的位置非常近,并且拥有快速的服务器。(在 Explorer 中,D/L 几乎没有任何作用)而且我没有使用任何代理。

My Crawler has about 20 threads reading simultaneously from the same site. Could this be causing a problem?

我的爬虫有大约 20 个线程同时从同一站点读取。这会导致问题吗?

How do I reduce StreamReader.ReadToEnd times DRASTICALLY?

如何大幅减少 StreamReader.ReadToEnd 时间?

回答by Matt Brindley

WebClient's DownloadString is a simple wrapper for HttpWebRequest, could you try using that temporarily and see if the speed improves? If things get much faster, could you share your code so we can have a look at what may be wrong with it?

WebClient 的 DownloadString 是 HttpWebRequest 的一个简单包装器,您可以尝试暂时使用它,看看速度是否有所提高?如果事情变得更快,您能否分享您的代码,以便我们看看它可能有什么问题?

EDIT:

编辑:

It seems HttpWebRequest observes IE's 'max concurrent connections' setting, are these URLs on the same domain? You could try increasing the connections limit to see if that helps? I found this articleabout the problem:

似乎 HttpWebRequest 观察了 IE 的“最大并发连接数”设置,这些 URL 是否在同一个域中?您可以尝试增加连接限制,看看是否有帮助?我找到了这篇关于这个问题的文章

By default, you can't perform more than 2-3 async HttpWebRequest (depends on the OS). In order to override it (the easiest way, IMHO) don't forget to add this under section in the application's config file:

默认情况下,您不能执行超过 2-3 个异步 HttpWebRequest(取决于操作系统)。为了覆盖它(最简单的方法,恕我直言)不要忘记在应用程序的配置文件的部分下添加它:

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="65000" />
  </connectionManagement>
</system.net>

回答by kgriffs

HttpWebRequest may be taking a while to detect your proxy settings. Try adding this to your application config:

HttpWebRequest 可能需要一段时间来检测您的代理设置。尝试将其添加到您的应用程序配置中:

<system.net>
  <defaultProxy enabled="false">
    <proxy/>
    <bypasslist/>
    <module/>
  </defaultProxy>
</system.net>

You might also see a slight performance gain from buffering your reads to reduce the number of calls made to the underlying operating system socket:

您可能还会通过缓冲读取来减少对底层操作系统套接字的调用次数,从而获得轻微的性能提升:

using (BufferedStream buffer = new BufferedStream(stream))
{
  using (StreamReader reader = new StreamReader(buffer))
  {
    pageContent = reader.ReadToEnd();
  }
}

回答by No Refunds No Returns

Have you tried ServicePointManager.maxConnections? I usually set it to 200 for things similar to this.

您是否尝试过 ServicePointManager.maxConnections?对于类似的事情,我通常将其设置为 200。

回答by vt2

I had problem the same problem but worst. response = (HttpWebResponse)webRequest.GetResponse(); in my code delayed about 10 seconds before running more code and after this the download saturated my connection.

我遇到了同样的问题,但最糟糕的是。response = (HttpWebResponse)webRequest.GetResponse(); 在我的代码中,在运行更多代码之前延迟了大约 10 秒,此后下载使我的连接饱和。

kurt's answer defaultProxy enabled="false"

kurt 的回答 defaultProxy enabled="false"

solved the problem. now the response is almost instantly and i can download any http file at my connections maximum speed :) sorry for bad english

解决了这个问题。现在响应几乎是即时的,我可以以我的最大连接速度下载任何 http 文件:) 抱歉英语不好

回答by thunder

I found the Application Config method did not work, but the problem was still due to the proxy settings. My simple request used to take up to 30 seconds, now it takes 1.

我发现应用程序配置方法不起作用,但问题仍然是由于代理设置。我的简单请求过去最多需要 30 秒,现在只需要 1 秒。

public string GetWebData()
{
            string DestAddr = "http://mydestination.com";
            System.Net.WebClient myWebClient = new System.Net.WebClient();
            WebProxy myProxy = new WebProxy();
            myProxy.IsBypassed(new Uri(DestAddr));
            myWebClient.Proxy = myProxy;
            return myWebClient.DownloadString(DestAddr);
}

回答by bisand

I had the same problem, but when I sat the HttpWebRequest's Proxy parameter to null, it solved the problem.

我遇到了同样的问题,但是当我将 HttpWebRequest 的 Proxy 参数设置为 null 时,它解决了问题。

UriBuilder ub = new UriBuilder(url);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( ub.Uri );
request.Proxy = null;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();

回答by Yuriy

Thank you all for answers, they've helped me to dig in proper direction. I've faced with the same performance issue, though proposed solution to change application config file (as I understood that solution is for web applications) doesn't fit my needs, my solution is shown below:

谢谢大家的回答,他们帮助我找到了正确的方向。我遇到了同样的性能问题,尽管提议的更改应用程序配置文件的解决方案(据我所知,该解决方案适用于 Web 应用程序)不符合我的需求,但我的解决方案如下所示:

HttpWebRequest webRequest;

webRequest = (HttpWebRequest)System.Net.WebRequest.Create(fullUrl);
webRequest.Method = WebRequestMethods.Http.Post;

if (useDefaultProxy)
{
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
    webRequest.Credentials = CredentialCache.DefaultCredentials;
}
else
{
    System.Net.WebRequest.DefaultWebProxy = null;
    webRequest.Proxy = System.Net.WebRequest.DefaultWebProxy;
}

回答by Pangamma

Why wouldn't multithreading solve this issue? Multithreading would minimize the network wait times, and since you'd be storing the contents of the buffer in system memory (RAM), there would be no IO bottleneck from dealing with a filesystem. Thus, your 82 pages that take 82 seconds to download and parse, should take like 15 seconds (assuming a 4x processor). Correct me if I'm missing something.

为什么多线程不能解决这个问题?多线程将最大限度地减少网络等待时间,并且由于您将缓冲区的内容存储在系统内存 (RAM) 中,因此处理文件系统不会产生 IO 瓶颈。因此,下载和解析需要 82 秒的 82 个页面应该需要 15 秒(假设使用 4x 处理器)。如果我遗漏了什么,请纠正我。

____ DOWNLOAD THREAD_____*

____ 下载线程_____*

Download Contents

下载内容

Form Stream

表单流

Read Contents

阅读内容

_________________________*

_________________________*