C# HttpWebResponse 的编码问题

Question

提问by Patrick Desjardins

Here is a snippet of the code :

这是代码的一个片段：

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl);
WebRequest.DefaultWebProxy = null;//Ensure that we will not loop by going again in the proxy
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
string charSet = response.CharacterSet;
Encoding encoding;
if (String.IsNullOrEmpty(charSet))
encoding = Encoding.Default;
else
encoding = Encoding.GetEncoding(charSet);

StreamReader resStream = new StreamReader(response.GetResponseStream(), encoding);
return resStream.ReadToEnd();

The problem is if I test with : http://www.google.fr

问题是如果我测试：http: //www.google.fr

All "é" are not displaying well. I have try to change ASCII to UTF8 and it still display wrong. I have tested the html file in a browser and the browser display the html text well so I am pretty sure the problem is in the method I use to download the html file.

所有“é”都不能很好地显示。我尝试将 ASCII 更改为 UTF8，但它仍然显示错误。我已经在浏览器中测试了 html 文件，并且浏览器可以很好地显示 html 文本，所以我很确定问题出在我用来下载 html 文件的方法中。

What should I change?

我应该改变什么？

removed dead ImageShack link

删除了无效的 ImageShack 链接

Update 1: Code and test file changed

更新 1：代码和测试文件已更改

Answer 1

采纳答案by Jon Skeet

Firstly, the easier way of writing that code is to use a StreamReader and ReadToEnd:

首先，编写该代码的更简单方法是使用 StreamReader 和 ReadToEnd：

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL);
using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse())
{
    using (Stream resStream = response.GetResponseStream())
    {
        StreamReader reader = new StreamReader(resStream, Encoding.???);
        return reader.ReadToEnd();
    }
}

Then it's "just" a matter of finding the right encoding. How did you create the file? If it's with Notepad then you probably want Encoding.Default- but that's obviously not portable, as it's the default encoding for yourPC.

然后它“只是”找到正确的编码的问题。你是如何创建文件的？如果它与记事本一起使用，那么您可能想要Encoding.Default- 但这显然不可移植，因为它是您PC的默认编码。

In a well-run web server, the response will indicate the encoding in its headers. Having said that, response headers sometimes claim one thing and the HTML claims another, in some cases.

在运行良好的 Web 服务器中，响应将在其标头中指明编码。话虽如此，在某些情况下，响应标头有时会声明一件事，而 HTML 声明另一件事。

Answer 2

回答by Alex Dubinsky

CharacterSet is "ISO-8859-1" by default, if it is not specified in server's content type header (different from "charset" meta tag in HTML). I compare HttpWebResponse.CharacterSet with charset attribute of HTML. If they are different - I use the charset as specified in HTML to re-read the page again, but with correct encoding this time.

CharacterSet 默认为“ISO-8859-1”，如果它没有在服务器的内容类型标头中指定（不同于 HTML 中的“charset”元标记）。我将 HttpWebResponse.CharacterSet 与 HTML 的 charset 属性进行比较。如果它们不同 - 我使用 HTML 中指定的字符集再次重新读取页面，但这次使用正确的编码。

See the code:

看代码：

    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);
    // read response
    using (StreamReader sr = 
           new StreamReader(objResponse.GetResponseStream(), encoding))
    {
        strWebPage = sr.ReadToEnd();
        // Close and clean up the StreamReader
        sr.Close();
    }

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset = 
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if(RealCharset!=Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // read the web page again, but with correct encoding this time
            //   create request
            System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL);
            //   get response
            System.Net.HttpWebResponse objResponse2;
            objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse();
            //   read response
            using (StreamReader sr = 
              new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding))
            {
                strWebPage = sr.ReadToEnd();
                // Close and clean up the StreamReader
                sr.Close();
            }
        }
    }

Answer 3

回答by Eddo

In case you don't want to download the page twice, I slightly modified Alex's code using How do I put a WebResponse into a memory stream?. Here's the result

如果您不想将页面下载两次，我使用How do I put a WebResponse into a memory stream?稍微修改了 Alex 的代码？. 这是结果

public static string DownloadString(string address)
{
    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);

    // read response into memory stream
    MemoryStream memoryStream;
    using (Stream responseStream = objResponse.GetResponseStream())
    {
        memoryStream = new MemoryStream();

        byte[] buffer = new byte[1024];
        int byteCount;
        do
        {
            byteCount = responseStream.Read(buffer, 0, buffer.Length);
            memoryStream.Write(buffer, 0, byteCount);
        } while (byteCount > 0);
    }

    // set stream position to beginning
    memoryStream.Seek(0, SeekOrigin.Begin);

    StreamReader sr = new StreamReader(memoryStream, encoding);
    strWebPage = sr.ReadToEnd();

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset =
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if (RealCharset != Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // reset stream position to beginning
            memoryStream.Seek(0, SeekOrigin.Begin);

            // reread response stream with the correct encoding
            StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);

            strWebPage = sr2.ReadToEnd();
            // Close and clean up the StreamReader
            sr2.Close();
        }
    }

    // dispose the first stream reader object
    sr.Close();

    return strWebPage;
}

Answer 4

回答by Tony Zeng

I studied the same problem with the help of WireShark, a great protocol analyser. I think that there are some design short coming to the httpWebResponse class. In fact, the whole message entity was downloaded the first time you invoking the GetResponse() method of the HttpWebRequest class, but the framework have no place to hold the data in the HttpWebResponse class or somewhere else, resulting you have to get the response stream the second time.

我在很棒的协议分析器 WireShark 的帮助下研究了同样的问题。我认为 httpWebResponse 类有一些设计上的不足。其实整个消息实体是在你第一次调用HttpWebRequest类的GetResponse()方法的时候下载的，但是框架没有地方在HttpWebResponse类或其他地方保存数据，导致你必须得到响应流第二次。

Answer 5

回答by Etienne Coumont

There is still some problems when requesting the web page "www.google.fr" from a WebRequest.

从 WebRequest 请求网页“www.google.fr”时仍然存在一些问题。

I checked the raw request and response with Fiddler. The problem comes from Google servers. The response HTTP headers are set to charset=ISO-8859-1, the text itself is encoded with ISO-8859-1, while the HTML says charset=UTF-8. This is incoherent and lead to encoding errors.

我用 Fiddler 检查了原始请求和响应。问题来自谷歌服务器。响应 HTTP 标头设置为 charset=ISO-8859-1，文本本身使用 ISO-8859-1 编码，而 HTML 表示 charset=UTF-8。这是不连贯的，会导致编码错误。

After many tests, I managed to find a workaround. Just add :

经过多次测试，我设法找到了解决方法。只需添加：

myHttpWebRequest.UserAgent = "Mozilla/5.0";

to your code, and Google Response will magically and entirely become UTF-8.

到您的代码，Google Response 将神奇地完全变成 UTF-8。

Answer 6

回答by KinBread

This is code that download one time.

这是一次下载的代码。

String FinalResult = "";
HttpWebRequest Request = (HttpWebRequest)System.Net.WebRequest.Create( URL );
HttpWebResponse Response = (HttpWebResponse)Request.GetResponse();
Stream ResponseStream = Response.GetResponseStream();
StreamReader Reader = new StreamReader( ResponseStream );

bool NeedEncodingCheck = true;

while( true )
{
    string NewLine = Reader.ReadLine(); // it may not working for zipped HTML.
    if( NewLine == null )
    {
        break;
    }

    FinalResult += NewLine;
    FinalResult += Environment.NewLine;

    if( NeedEncodingCheck )
    {
        int Start = NewLine.IndexOf( "charset=" );
        if( Start > 0 )
        {
            Start += "charset=\"".Length;   
            int End = NewLine.IndexOfAny( new[] { ' ', '\"', ';' }, Start );

            Reader = new StreamReader( ResponseStream, Encoding.GetEncoding(
                NewLine.Substring( Start, End - Start ) ) ); // Replace Reader with new encoding.

            NeedEncodingCheck = false;
        }
    }
}

Reader.Close();
Response.Close();

Answer 7

回答by stephenr85

There are some good solutions here, but they all seem to be trying to parse the charset out of the content type string. Here's a solution using System.Net.Mime.ContentType, which should be more reliable, and shorter.

这里有一些很好的解决方案，但它们似乎都试图从内容类型字符串中解析字符集。这是使用 System.Net.Mime.ContentType 的解决方案，它应该更可靠，更短。

 var client = new System.Net.WebClient();
 var data = client.DownloadData(url);
 var encoding = System.Text.Encoding.Default;
 var contentType = new System.Net.Mime.ContentType(client.ResponseHeaders[HttpResponseHeader.ContentType]);
 if (!String.IsNullOrEmpty(contentType.CharSet))
 {
      encoding = System.Text.Encoding.GetEncoding(contentType.CharSet);
 }
 string result = encoding.GetString(data);

C# HttpWebResponse 的编码问题

提问by Patrick Desjardins

Update 1: Code and test file changed

更新 1：代码和测试文件已更改

采纳答案by Jon Skeet

回答by Alex Dubinsky

回答by Eddo

回答by Tony Zeng

回答by Etienne Coumont

回答by KinBread

回答by stephenr85

相关推荐

最近更新

标签

C# HttpWebResponse 的编码问题

提问by Patrick Desjardins

Update 1: Code and test file changed

更新 1：代码和测试文件已更改

采纳答案by Jon Skeet

回答by Alex Dubinsky

回答by Eddo

回答by Tony Zeng

回答by Etienne Coumont

回答by KinBread

回答by stephenr85

相关推荐

如何在 C# 中的随机端口上创建 HttpListener 类？

使用 C# 中的正则表达式从完整路径解析文件名

C# 引用同一程序集的不同版本

C# 如何打开替代的网络浏览器（Mozilla 或 Firefox）并显示特定的 url？

相关推荐

最近更新

标签