C# HttpClient爬取导致内存泄漏
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/14075026/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
HttpClient crawling results in memory leak
提问by Aliostad
I am working on a WebCrawler implementationbut am facing a strange memory leak in ASP.NET Web API's HttpClient.
我正在研究 WebCrawler实现,但在 ASP.NET Web API 的 HttpClient 中遇到了奇怪的内存泄漏。
So the cut down version is here:
所以精简版在这里:
[UPDATE 2]
[更新 2]
I found the problem and it is not HttpClient that is leaking. See my answer.
我发现了问题,它不是 HttpClient 泄漏。看我的回答。
[UPDATE 1]
[更新 1]
I have added dispose with no effect:
我添加了 dispose 没有效果:
static void Main(string[] args)
{
int waiting = 0;
const int MaxWaiting = 100;
var httpClient = new HttpClient();
foreach (var link in File.ReadAllLines("links.txt"))
{
while (waiting>=MaxWaiting)
{
Thread.Sleep(1000);
Console.WriteLine("Waiting ...");
}
httpClient.GetAsync(link)
.ContinueWith(t =>
{
try
{
var httpResponseMessage = t.Result;
if (httpResponseMessage.IsSuccessStatusCode)
httpResponseMessage.Content.LoadIntoBufferAsync()
.ContinueWith(t2=>
{
if(t2.IsFaulted)
{
httpResponseMessage.Dispose();
Console.ForegroundColor = ConsoleColor.Magenta;
Console.WriteLine(t2.Exception);
}
else
{
httpResponseMessage.Content.
ReadAsStringAsync()
.ContinueWith(t3 =>
{
Interlocked.Decrement(ref waiting);
try
{
Console.ForegroundColor = ConsoleColor.White;
Console.WriteLine(httpResponseMessage.RequestMessage.RequestUri);
string s =
t3.Result;
}
catch (Exception ex3)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine(ex3);
}
httpResponseMessage.Dispose();
});
}
}
);
}
catch(Exception e)
{
Interlocked.Decrement(ref waiting);
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine(e);
}
}
);
Interlocked.Increment(ref waiting);
}
Console.Read();
}
The file containing links is available here.
包含链接的文件可在此处获得。
This results in constant rising of the memory. Memory analysis shows many bytes held possibly by the AsyncCallback. I have done many memory leak analysis before but this one seems to be at the HttpClient level.
这导致内存不断上升。内存分析显示 AsyncCallback 可能持有的许多字节。我之前做过很多内存泄漏分析,但这一次似乎是在 HttpClient 级别。
I am using C# 4.0 so no async/await here so only TPL 4.0 is used.
我使用的是 C# 4.0,所以这里没有 async/await,所以只使用了 TPL 4.0。
The code above works but is not optimised and sometimes throws tantrum yet is enough to reproduce the effect. Point is I cannot find any point that could cause memory to be leaked.
上面的代码有效但没有优化,有时会发脾气但足以重现效果。重点是我找不到任何可能导致内存泄漏的点。
采纳答案by Aliostad
OK, I got to the bottom of this. Thanks to @Tugberk, @Darrel and @youssef for spending time on this.
好的,我明白了。感谢 @Tugberk、@Darrel 和 @youssef 花时间在这上面。
Basically the initial problem was I was spawning too many tasks. This started to take its toll so I had to cut back on this and have some state for making sure the number of concurrent tasks are limited. This is basically a big challenge for writing processes that have to use TPL to schedule the tasks. We can control threads in the thread pool but we also need to control the tasks we are creating so no level of async/await
will help this.
基本上最初的问题是我产生了太多的任务。这开始产生影响,所以我不得不减少这一点,并有一些状态来确保并发任务的数量是有限的。这对于编写必须使用 TPL 来调度任务的流程来说基本上是一个很大的挑战。我们可以控制线程池中的线程,但我们也需要控制我们正在创建的任务,所以没有任何级别的async/await
帮助。
I managed to reproduce the leak only a couple of times with this code - other times after growing it would just suddenly drop. I know that there was a revamp of GC in 4.5 so perhaps the issue here is that GC did not kick in enough although I have been looking at perf counters on GC generation 0, 1 and 2 collections.
我设法用这段代码只重现了几次泄漏 - 其他时候在增长之后它会突然下降。我知道在 4.5 中对 GC 进行了改造,所以也许这里的问题是 GC 没有足够的发挥作用,尽管我一直在查看 GC 第 0、1 和 2 代收集的性能计数器。
So the take-away here is that re-using HttpClient
does NOT cause memory leak.
所以这里的要点是重用HttpClient
不会导致内存泄漏。
回答by tugberk
I'm no good at defining memory issues but I gave it a try with the following code. It's in .NET 4.5 and uses async/await feature of C#, too. It seems to keep memory usage around 10 - 15 MB for the entire process (not sure if you see this a better memory usage though). But if you watch # Gen 0 Collections, # Gen 1 Collectionsand # Gen 2 Collectionsperf counters, they are pretty high with the below code.
我不擅长定义内存问题,但我尝试使用以下代码。它在 .NET 4.5 中并且也使用 C# 的异步/等待功能。它似乎将整个过程的内存使用量保持在 10 - 15 MB 左右(但不确定你是否认为这是更好的内存使用量)。但是,如果您观看# Gen 0 Collections、# Gen 1 Collections和# Gen 2 Collectionsperf counters,它们会在下面的代码中非常高。
If you remove the GC.Collect
calls below, it goes back and forth between 30MB - 50MB for entire process. The interesting part is that when I run your code on my 4 core machine, I don't see abnormal memory usage by the process either. I have .NET 4.5 installed on my machine and if you don't, the problem might be related to CLR internals of .NET 4.0 and I am sure that TPL has improved a lot on .NET 4.5 based on resource usage.
如果您删除GC.Collect
下面的调用,它会在整个过程中在 30MB 到 50MB 之间来回移动。有趣的是,当我在我的 4 核机器上运行你的代码时,我也没有看到进程使用异常内存。我在我的机器上安装了 .NET 4.5,如果你没有安装,问题可能与 .NET 4.0 的 CLR 内部结构有关,我确信基于资源使用情况,TPL 在 .NET 4.5 上有很大改进。
class Program {
static void Main(string[] args) {
ServicePointManager.DefaultConnectionLimit = 500;
CrawlAsync().ContinueWith(task => Console.WriteLine("***DONE!"));
Console.ReadLine();
}
private static async Task CrawlAsync() {
int numberOfCores = Environment.ProcessorCount;
List<string> requestUris = File.ReadAllLines(@"C:\Users\Tugberk\Downloads\links.txt").ToList();
ConcurrentDictionary<int, Tuple<Task, HttpRequestMessage>> tasks = new ConcurrentDictionary<int, Tuple<Task, HttpRequestMessage>>();
List<HttpRequestMessage> requestsToDispose = new List<HttpRequestMessage>();
var httpClient = new HttpClient();
for (int i = 0; i < numberOfCores; i++) {
string requestUri = requestUris.First();
var requestMessage = new HttpRequestMessage(HttpMethod.Get, requestUri);
Task task = MakeCall(httpClient, requestMessage);
tasks.AddOrUpdate(task.Id, Tuple.Create(task, requestMessage), (index, t) => t);
requestUris.RemoveAt(0);
}
while (tasks.Values.Count > 0) {
Task task = await Task.WhenAny(tasks.Values.Select(x => x.Item1));
Tuple<Task, HttpRequestMessage> removedTask;
tasks.TryRemove(task.Id, out removedTask);
removedTask.Item1.Dispose();
removedTask.Item2.Dispose();
if (requestUris.Count > 0) {
var requestUri = requestUris.First();
var requestMessage = new HttpRequestMessage(HttpMethod.Get, requestUri);
Task newTask = MakeCall(httpClient, requestMessage);
tasks.AddOrUpdate(newTask.Id, Tuple.Create(newTask, requestMessage), (index, t) => t);
requestUris.RemoveAt(0);
}
GC.Collect(0);
GC.Collect(1);
GC.Collect(2);
}
httpClient.Dispose();
}
private static async Task MakeCall(HttpClient httpClient, HttpRequestMessage requestMessage) {
Console.WriteLine("**Starting new request for {0}!", requestMessage.RequestUri);
var response = await httpClient.SendAsync(requestMessage).ConfigureAwait(false);
Console.WriteLine("**Request is completed for {0}! Status Code: {1}", requestMessage.RequestUri, response.StatusCode);
using (response) {
if (response.IsSuccessStatusCode){
using (response.Content) {
Console.WriteLine("**Getting the HTML for {0}!", requestMessage.RequestUri);
string html = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
Console.WriteLine("**Got the HTML for {0}! Legth: {1}", requestMessage.RequestUri, html.Length);
}
}
else if (response.Content != null) {
response.Content.Dispose();
}
}
}
}
回答by John Peters
A recent reported "Memory Leak" in our QA environment taught us this:
最近在我们的 QA 环境中报告的“内存泄漏”告诉我们:
Consider the TCP Stack
考虑 TCP 堆栈
Don't assume the TCP Stack can do what is asked in the time "thought appropriate for the application". Sure we can spin off Tasks at will and we just love asych, but....
不要假设 TCP 堆栈可以在“认为适合应用程序”的时间内执行要求的操作。当然,我们可以随意拆分 Tasks,我们只是喜欢 asych,但是....
Watch the TCP Stack
观看 TCP 堆栈
Run NETSTAT when you think you have a memory leak. If you see residual sessions or half-baked states, you may want to rethink your design along the lines of HTTPClient reuse and limiting the amount of concurrent work being spun up. You also may need to consider using Load Balancing across multiple machines.
当您认为存在内存泄漏时运行 NETSTAT。如果您看到剩余会话或半生不熟的状态,您可能希望按照 HTTPClient 重用和限制正在启动的并发工作量的方式重新考虑您的设计。您可能还需要考虑跨多台机器使用负载平衡。
Half-baked sessions show up in NETSTAT with Fin-Waits 1 or 2 and Time-Waits or even RST-WAIT 1 and 2. Even "Established" sessions can be virtually dead just waiting for time-outs to fire.
NETSTAT 中出现半生不熟的会话,其中包含 Fin-Waits 1 或 2 和 Time-Waits 甚至 RST-WAIT 1 和 2。即使是“已建立”的会话也可能实际上已经死了,只是等待超时触发。
The Stack and .NET are most likely not broken
Stack 和 .NET 很可能没有被破坏
Overloading the stack puts the machine to sleep. Recovery takes time and 99% of the time the stack will recover. Remember also that .NET will not release resources before their time and that no user has full control of GC.
重载堆栈会使机器进入睡眠状态。恢复需要时间,并且堆栈将恢复 99% 的时间。还请记住,.NET 不会在资源到期之前释放资源,并且没有用户可以完全控制 GC。
If you kill the app and it takes 5 minutes for NETSTAT to settle down, that's a pretty good sign the system is overwhelmed. It's also a good show of how the stack is independent of the application.
如果您关闭应用程序并且 NETSTAT 需要 5 分钟才能稳定下来,那么这是系统不堪重负的一个很好的迹象。这也很好地展示了堆栈如何独立于应用程序。
回答by Elad Nava
The default HttpClient
leaks when you use it as a short-lived object and create new HttpClients per request.
HttpClient
当您将其用作短期对象并为每个请求创建新的 HttpClient 时,默认会泄漏。
Hereis a reproduction of this behavior.
这是这种行为的再现。
As a workaround, I was able to keep using HttpClient as a short-lived object by using the following Nuget package instead of the built-in System.Net.Http
assembly:
https://www.nuget.org/packages/HttpClient
作为一种解决方法,通过使用以下 Nuget 包而不是内置System.Net.Http
程序集,我能够继续使用 HttpClient 作为短期对象:https:
//www.nuget.org/packages/HttpClient
Not sure what the origin of this package is, however, as soon as I referenced it the memory leak disappeared. Make sure that you remove the reference to the built-in .NET System.Net.Http
library and use the Nuget package instead.
不确定这个包的来源是什么,但是,一旦我引用它,内存泄漏就消失了。确保删除对内置 .NETSystem.Net.Http
库的引用并改用 Nuget 包。