用java从网页中读取源代码

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19293235/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-12 15:46:19  来源:igfitidea点击:

Reading source code from a webpage in java

javahtml-content-extraction

提问by Ahmad Ali

I am trying to read source code from a webpage. My java code is

我正在尝试从网页中读取源代码。我的Java代码是

import java.net.*;
import java.io.*;
import java.util.*;
import javax.swing.JOptionPane;

class Testing{
public static void Connect() throws Exception{


  URL url = new URL("http://excite.com/education");
  URLConnection spoof = url.openConnection();


  spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
  BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
  String strLine = "";


  while ((strLine = in.readLine()) != null){


   System.out.println(strLine);
  }

  System.out.println("End of page.");
 }

 public static void main(String[] args){

  try{

   Connect();
  }catch(Exception e){

  }
}

When i compile and run this code, it gives the following output:

当我编译并运行此代码时,它提供以下输出:

?I?%&/m?{J?J??t?$?@??????iG#)?*??eVe]f@????{???{???;?N'????\fdl??J??!????~|?"~?$}?>???????4?????7N?????+??M?N???J?tZfM??G?j????R??!?9??>JgE??Ge[????????W???????8?????? ?|8? ?????????ho????0????|?:--?|?L?Uο????m?zt?n3??l\?w??O^f?G[?CG<?y6K??gM?rg???y?E?y????h?????X???l=??Z?/????(?^O?UU6?????&?6_?@yC}?p?y???lAH????zF#?V?6_??}??)?v=J+?$????G?Y?L?b???wS"?7?y^????Z?m???Y:????J<N_?Y=???U?f???,???y?Q2(J?P!??i????1&F0&?n???x?T??h?Qzw?+????n?)?h??K??2????8g???????A0 ???1I?%????Q?Z????{????????w????x????N???<d?S????%a|4?j??z???k?Bak??k-?c?z?g??z???l>????s^,??5??/B?{????]]????Y?????y{?_l?8g?k???b???"+|??(??M??^[???J?P??_?..???????x?Z?$??????E>????u???E~????{媘???f?e1??QZ,?????f??e?3J?b?^??4??????>??y??;??<?{?l??ZfW S@ {?]??1??Q?????n[?,t??????~?n?S?u#SL??n?^?????????EC??q?/?y???FE?tpm??????e&??oB???z9eY????????P??IK??????????w?N??;?;J?????;?/??5???M???rZ??q??]??C?d???F?nd???}???A5???M?5?.?:??/?_D???3????'?c?Z7??}??(OI),?i????{?<?w???????DZ?e????'q???eY]=???kj??????????\qhrRn???l?o-??.???k??_???oD8??GA?P?r??|$???Pv~Y?:?[q??sH?? <??C????^N?[ v(??S??l?c?C????3???E5&5?V?L?T??????oQr???/???#[f?5?5"????[???t?vm?\??.0?nh????a?WYM ^T?|\,????L?u????B???C?r?????? ???????'?%?{??)?);?fV?]??g,?>?C ?c2???p?4??}H???P??(?%j"?}?&?:?Oh\5I?l?氪??{?/?]?LB?l??2??I"??=??Y?|?>??n???????}?????~?[??'??O?? ??:/?)?Wz?3??lo?.5?k?&????H[ji?????b??????WWy}?5??Q?|f?????]?KjH5??}yNm?????g??????>??'o??泏??<???G?g???>->?xQM?????%<?|????u?.??3???[?[r????;???]4E??6[????]????1???*?8}??n?w????????????|????}|qo|?~u????w|?i?i???Z?`z??????Q}?u??!???w ?O???R9?)?~??g~?w6??{???wd?o??/Z?uUS???l??I^?????>??[?U1?o?_??J??}??@?@?U?/??/????i?7|CZT?(?2b~????c?W?c5'????EeF???0??T??{??W?2????/???O???YJj????K/???>??:'_l?

? I?%&/m?{J?J??t?$?@??????iG#)?*??eVe]f@????{???{???;?N'????\fdl??J??!?? ??~|?"~?$}?>???????4?????7N?????+??M?N???J?tZfM??G?j?? ??R??!?9??>JgE??Ge[????????W???????8?????? ?|8? ??????? ??ho????0????|?:--?|?L?Uο????m?zt?n3??l\?w??O^f?G[?CG<? y6K??gM?rg???y?E?y????h?????X???l=??Z?/??(?^O?UU6????? &?6_? @yC}?p?y???lAH????zF#?V?6_??}??)?v=J+?$????G?Y?L?b???wS"?7?y^????Z?m???Y:????J<N_?Y=???U?f???,???y?Q2(J?P!??i????1&F0&?n???x?T??h?Qzw?+????n?)?h??K??2????8g???????A0 ???1I?%????Q?Z????{????????w????x????N???<d?S????%a|4?j??z???k?Bak??k-?c?z?g??z???l>????s^,??5??/B?{????]]????Y?????y{?_l?8g?k???b? ??"+|??(??M??^[ J?P??_?..???????x?Z?$?????????E>????u???E~????{媘???f?e1?? QZ,?????f??e?3J?b?^??4??????> ??y??;??<?{?l??ZfW S@ {?]?? 1??Q?????n[ ?,t??????~?n?S?u#SL??n?^?????????EC??q?/?y???FE?tpm??????e&??oB???z9eY????????P??IK??????????w?N??;?;J?????;?/??5???M???rZ??q??]??C?d???F?nd???}???A5???M?5?.?:??/?_D???3????'?c?Z7??}??(OI),?i????{?<?w???????DZ?e????'q???eY]=???kj??????????\qhrRn???l?o-??.???k??_???oD8??GA?P?r??|$???Pv~Y?:?[q??sH?? <??C????^N?[ v(??S??l?c?C????3???E5&5?V?L?T??????oQr???/???#[f?5?5"????[???t?vm?\??.0?nh????a?WYM ^T?|\,????L?u? ???B???C?r?????????????'?%?{??)?);?fV?]??g,?>?C ?c2?? ?p?4??}H???P??(?%j"?}?&?:?Oh\5I?l?氪??{?/?]?LB?l??2??I "??=??Y?|?>??n???????}?????~?[??' ??O???? :/?)?Wz?3?? lo?.5?k?&??????>??'o??泏??<???G?g???>->?xQM?????%<?|????u?.??3 ???[?[r????;???]4E??6[????]????1???*?8}??n?w??????? ?????|????}|qo|?~u????w|?i?i???Z?`z??????Q}?u??!??? w ?O???R9?)?~??g~?w6??{?wd?o??/Z?uUS???l??I^???>??[? U1?o?_??J??}??@?@?U?/??/????i?7|CZT?(?2b~????c?W?c5'??? ?EeF???0??T??{??W?2????/???O???YJj????K/???>??:'_l?

Other than URLs from this directory i.e. "excite.com/education" all URLs are giving correct source codes but these URLs are creating problems.

除了来自该目录的 URL,即“excite.com/education”,所有 URL 都提供了正确的源代码,但这些 URL 会产生问题。

Anyone Please Help.

任何人请帮助。

Thanks in advance.

提前致谢。

采纳答案by Samuel Petrosyan

It works for me.

这个对我有用。

private static String getWebPabeSource(String sURL) throws IOException {
        URL url = new URL(sURL);
        URLConnection urlCon = url.openConnection();
        BufferedReader in = null;

        if (urlCon.getHeaderField("Content-Encoding") != null
                && urlCon.getHeaderField("Content-Encoding").equals("gzip")) {
            in = new BufferedReader(new InputStreamReader(new GZIPInputStream(
                    urlCon.getInputStream())));
        } else {
            in = new BufferedReader(new InputStreamReader(
                    urlCon.getInputStream()));
        }

        String inputLine;
        StringBuilder sb = new StringBuilder();

        while ((inputLine = in.readLine()) != null)
            sb.append(inputLine);
        in.close();

        return sb.toString();
}

回答by Dropout

Try reading it this way:

尝试这样阅读:

private static String getUrlSource(String url) throws IOException {
        URL url = new URL(url);
        URLConnection urlConn = url.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                urlConn.getInputStream(), "UTF-8"));
        String inputLine;
        StringBuilder a = new StringBuilder();
        while ((inputLine = in.readLine()) != null)
            a.append(inputLine);
        in.close();

        return a.toString();
    }

and set your encoding according to the web page - notice this line:

并根据网页设置您的编码 - 注意这一行:

BufferedReader in = new BufferedReader(new InputStreamReader(
                urlConn.getInputStream(), "UTF-8"));

回答by Shashi

First you have to uncompress the content using GZIPInputStream. Then put the uncompressed stream to Input Stream and read it using BufferedReader

首先,您必须使用 GZIPInputStream 解压缩内容。然后将未压缩的流放入InputStream中,使用BufferedReader读取

Use Apache HTTP Client 4.1.1

使用 Apache HTTP 客户端 4.1.1

Maven dependency

Maven 依赖

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.1.1</version>
</dependency>   

Sample Code to parse gzip content.

解析 gzip 内容的示例代码。

package com.gzip.simple;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.zip.GZIPInputStream;

import org.apache.http.Header;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;

public class GZIPFetcher {
    public static void main(String[] args) {
        try {

            DefaultHttpClient httpClient = new DefaultHttpClient();
            HttpGet getRequest = new HttpGet("http://excite.com/education");
            getRequest.addHeader("accept", "application/json");

            HttpResponse response = httpClient.execute(getRequest);

            if (response.getStatusLine().getStatusCode() != 200) {
                throw new RuntimeException("Failed : HTTP error code : "
                        + response.getStatusLine().getStatusCode());
            }

            InputStream instream = response.getEntity().getContent();

            // Check whether the content-encoding is gzip or not.
            Header contentEncoding = response
                    .getFirstHeader("Content-Encoding");

            if (contentEncoding != null
                    && contentEncoding.getValue().equalsIgnoreCase("gzip")) {
                instream = new GZIPInputStream(instream);
            }

            BufferedReader in = new BufferedReader(new InputStreamReader(
                    instream));

            String content;
            System.out.println("Output from Server .... \n");
            while ((content = in.readLine()) != null)
                System.out.println(content);

            httpClient.getConnectionManager().shutdown();

        } catch (ClientProtocolException e) {

            e.printStackTrace();

        } catch (IOException e) {

            e.printStackTrace();
        }

    }
}