用java从网页中读取源代码

Question

提问by Ahmad Ali

I am trying to read source code from a webpage. My java code is

我正在尝试从网页中读取源代码。我的Java代码是

import java.net.*;
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;

class Testing{
public static void Connect() throws Exception{


  URL url = new URL("http://excite.com/education");
  URLConnection spoof = url.openConnection();


  spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
  BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
  String strLine = "";


  while ((strLine = in.readLine()) != null){


   System.out.println(strLine);
  }

  System.out.println("End of page.");
 }

 public static void main(String[] args){

  try{

   Connect();
  }catch(Exception e){

  }
}

When i compile and run this code, it gives the following output:

当我编译并运行此代码时，它提供以下输出：

?I?%&/m?{J?J??t?$?@??????iG#)?*??eVe]f@????{???{???;?N'????\fdl??J??!????~|?"~?$}?>???????4?????7N?????+??M?N???J?tZfM??G?j????R??!?9??>JgE??Ge[????????W???????8?????? ?|8? ?????????ho????0????|?:--?|?L?Uο????m?zt?n3??l\?w??O^f?G[?CG<?y6K??gM?rg???y?E?y????h?????X???l=??Z?/????(?^O?UU6?????&?6_?@yC}?p?y???lAH????zF#?V?6_??}??)?v=J+?$????G?Y?L?b???wS"?7?y^????Z?m???Y:????J<N_?Y=???U?f???,???y?Q2(J?P!??i????1&F0&?n???x?T??h?Qzw?+????n?)?h??K??2????8g???????A0 ???1I?%????Q?Z????{????????w????x????N???<d?S????%a|4?j??z???k?Bak??k-?c?z?g??z???l>????s^,??5??/B?{????]]????Y?????y{?_l?8g?k???b???"+|??(??M??^[???J?P??_?..???????x?Z?$??????E>????u???E~????{媘???f?e1??QZ,?????f??e?3J?b?^??4??????>??y??;??<?{?l??ZfW S@ {?]??1??Q?????n[?,t??????~?n?S?u#SL??n?^?????????EC??q?/?y???FE?tpm??????e&??oB???z9eY????????P??IK??????????w?N??;?;J?????;?/??5???M???rZ??q??]??C?d???F?nd???}???A5???M?5?.?:??/?_D???3????'?c?Z7??}??(OI),?i????{?<?w???????DZ?e????'q???eY]=???kj??????????\qhrRn???l?o-??.???k??_???oD8??GA?P?r??|$???Pv~Y?:?[q??sH?? <??C????^N?[ v(??S??l?c?C????3???E5&5?V?L?T??????oQr???/???#[f?5?5"????[???t?vm?\??.0?nh????a?WYM ^T?|\,????L?u????B???C?r?????? ???????'?%?{??)?);?fV?]??g,?>?C ?c2???p?4??}H???P??(?%j"?}?&?:?Oh\5I?l?氪??{?/?]?LB?l??2??I"??=??Y?|?>??n???????}?????~?[??'??O?? ??:/?)?Wz?3??lo?.5?k?&????H[ji?????b??????WWy}?5??Q?|f?????]?KjH5??}yNm?????g??????>??'o??泏??<???G?g???>->?xQM?????%<?|????u?.??3???[?[r????;???]4E??6[????]????1???*?8}??n?w????????????|????}|qo|?~u????w|?i?i???Z?`z??????Q}?u??!???w ?O???R9?)?~??g~?w6??{???wd?o??/Z?uUS???l??I^?????>??[?U1?o?_??J??}??@?@?U?/??/????i?7|CZT?(?2b~????c?W?c5'????EeF???0??T??{??W?2????/???O???YJj????K/???>??:'_l?

? I?%&/m?{J?J??t?$?@??????iG#)?*??eVe]f@????{???{???;?N'????\fdl??J??!?? ??~|?"~?$}?>???????4?????7N?????+??M?N???J?tZfM??G?j?? ??R??!?9??>JgE??Ge[????????W???????8?????? ?|8? ??????? ??ho????0????|?:--?|?L?Uο????m?zt?n3??l\?w??O^f?G[?CG<? y6K??gM?rg???y?E?y????h?????X???l=??Z?/??(?^O?UU6????? &?6_? @yC}?p?y???lAH????zF#?V?6_??}??)?v=J+?$????G?Y?L?b???wS"?7?y^????Z?m???Y:????J<N_?Y=???U?f???,???y?Q2(J?P!??i????1&F0&?n???x?T??h?Qzw?+????n?)?h??K??2????8g???????A0 ???1I?%????Q?Z????{????????w????x????N???<d?S????%a|4?j??z???k?Bak??k-?c?z?g??z???l>????s^,??5??/B?{????]]????Y?????y{?_l?8g?k???b? ??"+|??(??M??^[ J?P??_?..???????x?Z?$?????????E>????u???E~????{媘???f?e1?? QZ,?????f??e?3J?b?^??4??????> ??y??;??<?{?l??ZfW S@ {?]?? 1??Q?????n[ ?,t??????~?n?S?u#SL??n?^?????????EC??q?/?y???FE?tpm??????e&??oB???z9eY????????P??IK??????????w?N??;?;J?????;?/??5???M???rZ??q??]??C?d???F?nd???}???A5???M?5?.?:??/?_D???3????'?c?Z7??}??(OI),?i????{?<?w???????DZ?e????'q???eY]=???kj??????????\qhrRn???l?o-??.???k??_???oD8??GA?P?r??|$???Pv~Y?:?[q??sH?? <??C????^N?[ v(??S??l?c?C????3???E5&5?V?L?T??????oQr???/???#[f?5?5"????[???t?vm?\??.0?nh????a?WYM ^T?|\,????L?u? ???B???C?r?????????????'?%?{??)?);?fV?]??g,?>?C ?c2?? ?p?4??}H???P??(?%j"?}?&?:?Oh\5I?l?氪??{?/?]?LB?l??2??I "??=??Y?|?>??n???????}?????~?[??' ??O???? :/?)?Wz?3?? lo?.5?k?&??????>??'o??泏??<???G?g???>->?xQM?????%<?|????u?.??3 ???[?[r????;???]4E??6[????]????1???*?8}??n?w??????? ?????|????}|qo|?~u????w|?i?i???Z?`z??????Q}?u??!??? w ?O???R9?)?~??g~?w6??{?wd?o??/Z?uUS???l??I^???>??[? U1?o?_??J??}??@?@?U?/??/????i?7|CZT?(?2b~????c?W?c5'??? ?EeF???0??T??{??W?2????/???O???YJj????K/???>??:'_l?

Other than URLs from this directory i.e. "excite.com/education" all URLs are giving correct source codes but these URLs are creating problems.

除了来自该目录的 URL，即“excite.com/education”，所有 URL 都提供了正确的源代码，但这些 URL 会产生问题。

Anyone Please Help.

任何人请帮助。

Thanks in advance.

提前致谢。

Answer 1

采纳答案by Samuel Petrosyan

It works for me.

这个对我有用。

private static String getWebPabeSource(String sURL) throws IOException {
        URL url = new URL(sURL);
        URLConnection urlCon = url.openConnection();
        BufferedReader in = null;

        if (urlCon.getHeaderField("Content-Encoding") != null
                && urlCon.getHeaderField("Content-Encoding").equals("gzip")) {
            in = new BufferedReader(new InputStreamReader(new GZIPInputStream(
                    urlCon.getInputStream())));
        } else {
            in = new BufferedReader(new InputStreamReader(
                    urlCon.getInputStream()));
        }

        String inputLine;
        StringBuilder sb = new StringBuilder();

        while ((inputLine = in.readLine()) != null)
            sb.append(inputLine);
        in.close();

        return sb.toString();
}

Answer 2

回答by Dropout

Try reading it this way:

尝试这样阅读：

private static String getUrlSource(String url) throws IOException {
        URL url = new URL(url);
        URLConnection urlConn = url.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                urlConn.getInputStream(), "UTF-8"));
        String inputLine;
        StringBuilder a = new StringBuilder();
        while ((inputLine = in.readLine()) != null)
            a.append(inputLine);
        in.close();

        return a.toString();
    }

and set your encoding according to the web page - notice this line:

并根据网页设置您的编码 - 注意这一行：

BufferedReader in = new BufferedReader(new InputStreamReader(
                urlConn.getInputStream(), "UTF-8"));

Answer 3

回答by Shashi

First you have to uncompress the content using GZIPInputStream. Then put the uncompressed stream to Input Stream and read it using BufferedReader

首先，您必须使用 GZIPInputStream 解压缩内容。然后将未压缩的流放入InputStream中，使用BufferedReader读取

Use Apache HTTP Client 4.1.1

使用 Apache HTTP 客户端 4.1.1

Maven dependency

Maven 依赖

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.1.1</version>
</dependency>

Sample Code to parse gzip content.

解析 gzip 内容的示例代码。

package com.gzip.simple;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.zip.GZIPInputStream;

import org.apache.http.Header;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;

public class GZIPFetcher {
    public static void main(String[] args) {
        try {

            DefaultHttpClient httpClient = new DefaultHttpClient();
            HttpGet getRequest = new HttpGet("http://excite.com/education");
            getRequest.addHeader("accept", "application/json");

            HttpResponse response = httpClient.execute(getRequest);

            if (response.getStatusLine().getStatusCode() != 200) {
                throw new RuntimeException("Failed : HTTP error code : "
                        + response.getStatusLine().getStatusCode());
            }

            InputStream instream = response.getEntity().getContent();

            // Check whether the content-encoding is gzip or not.
            Header contentEncoding = response
                    .getFirstHeader("Content-Encoding");

            if (contentEncoding != null
                    && contentEncoding.getValue().equalsIgnoreCase("gzip")) {
                instream = new GZIPInputStream(instream);
            }

            BufferedReader in = new BufferedReader(new InputStreamReader(
                    instream));

            String content;
            System.out.println("Output from Server .... \n");
            while ((content = in.readLine()) != null)
                System.out.println(content);

            httpClient.getConnectionManager().shutdown();

        } catch (ClientProtocolException e) {

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();
        }

    }
}

用java从网页中读取源代码

提问by Ahmad Ali

采纳答案by Samuel Petrosyan

回答by Dropout

回答by Shashi

相关推荐

最近更新

标签

用java从网页中读取源代码

提问by Ahmad Ali

采纳答案by Samuel Petrosyan

回答by Dropout

回答by Shashi

相关推荐

Java RestTemplate.postForObject - 错误：org.springframework.web.client.HttpClientErrorException：400 错误请求

异常：java.lang.NoSuchMethodError：com.lowagie.text.pdf.PdfWriter.setRgbTransparencyBlending(Z)V

Java 用于对 RGB 值进行编码的位移位和按位运算

Java PyCharm 对 Ctrl 快捷键没有响应

相关推荐

最近更新

标签