在java中获取网站源

Question

提问by Adam Lerman

I would like to use java to get the source of a website (secure) and then parse that website for links that are in it. I have found how to connect to that url, but then how can i easily get just the source, preferraby as the DOM Document oso that I could easily get the info I want.

我想使用 java 来获取网站的来源（安全），然后解析该网站以获取其中的链接。我已经找到了如何连接到那个 url，但是我怎样才能轻松地获得源代码，最好是作为 DOM 文档，以便我可以轻松获得我想要的信息。

Or is there a better way to connect to https site, get the source (which I neet to do to get a table of data...its pretty simple) then those links are files i am going to download.

或者有没有更好的方法来连接到 https 站点，获取源（我需要这样做以获取数据表......它非常简单）然后这些链接是我要下载的文件。

I wish it was FTP but these are files stored on my tivo (i want to programmatically download them to my computer(

我希望它是 FTP，但这些是存储在我的 tivo 上的文件（我想以编程方式将它们下载到我的计算机上（

Answer 1

采纳答案by Bernie Perez

You can get low level and just request it with a socket. In java it looks like

您可以获得低级别，只需使用套接字请求即可。在java中它看起来像

// Arg[0] = Hostname
// Arg[1] = File like index.html
public static void main(String[] args) throws Exception {
    SSLSocketFactory factory = (SSLSocketFactory) SSLSocketFactory.getDefault();

    SSLSocket sslsock = (SSLSocket) factory.createSocket(args[0], 443);

    SSLSession session = sslsock.getSession();
    X509Certificate cert;
    try {
        cert = (X509Certificate) session.getPeerCertificates()[0];
    } catch (SSLPeerUnverifiedException e) {
        System.err.println(session.getPeerHost() + " did not present a valid cert.");
        return;
    }

    // Now use the secure socket just like a regular socket to read pages.
    PrintWriter out = new PrintWriter(sslsock.getOutputStream());
    out.write("GET " + args[1] + " HTTP/1.0\r\n\r\n");
    out.flush();

    BufferedReader in = new BufferedReader(new InputStreamReader(sslsock.getInputStream()));
    String line;
    String regExp = ".*<a href=\"(.*)\">.*";
    Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

    while ((line = in.readLine()) != null) {
        // Using Oscar's RegEx.
        Matcher m = p.matcher( line );  
        if( m.matches() ) {
            System.out.println( m.group(1) );
        }
    }

    sslsock.close();
}

Answer 2

回答by Peter Hilton

Try HttpUnitor HttpClient. Although the former is ostensibly for writing integration tests, it has a convenient API for programmatically iterating through a web page's links, with something like the following use of WebResponse.getLinks():

尝试HttpUnit或HttpClient。虽然前者表面上是用于编写集成测试，但它有一个方便的 API 以编程方式迭代网页的链接，类似于以下使用WebResponse.getLinks()：

WebConversation wc = new WebConversation();
WebResponse resp = wc.getResponse("http://stackoverflow.com/questions/422970/");
WebLink[] links = resp.getLinks();
// Loop over array of links...

Answer 3

回答by Luca Matteis

You can use javacurlto get the site's html, and java DOMto analyze it.

您可以使用javacurl获取站点的 html，并使用java DOM对其进行分析。

Answer 4

回答by OscarRyz

Probably you could get better resutls from Pete's or sktrdie options. Here's an additional way if you would like to know how to do it "by had"

也许您可以从 Pete 或 sktrdie 选项中获得更好的结果。如果您想知道如何“通过拥有”来做到这一点，这是另一种方式

I'm not very good at regex so in this case it returns the last link in a line. Well, it's a start.

我不太擅长正则表达式，所以在这种情况下，它返回一行中的最后一个链接。嗯，这是一个开始。

import java.io.*;
import java.net.*;
import java.util.regex.*;

public class Links { 
    public static void main( String [] args ) throws IOException  { 

        URL url = new URL( args[0] );
        InputStream is = url.openConnection().getInputStream();

        BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );

        String line = null;
        String regExp = ".*<a href=\"(.*)\">.*";
        Pattern p = Pattern.compile( regExp, Pattern.CASE_INSENSITIVE );

        while( ( line = reader.readLine() ) != null )  {
            Matcher m = p.matcher( line );  
            if( m.matches() ) {
                System.out.println( m.group(1) );
            }
        }
        reader.close();
    }
}

EDIT

编辑

Ooops I totally missed the "secure" part. Anyway I couldn't help it, I had to write this sample :P

哎呀，我完全错过了“安全”部分。无论如何我忍不住了，我不得不写这个样本：P

Answer 5

回答by matt b

Extremely similar questions:

极其相似的问题：

Answer 6

回答by Lena Schimmel

There are two meanings of souce in a web context:

在 web 上下文中，souce 有两种含义：

The HTML source:If you request a webpage by URL, you always get the HTML source code. In fact, there is nothing else that you could get from the URL. Webpages are always transmitted in source form, there is no such thing as a compiled webpage. And for what you are trying, this should be enough to fulfill your task.

HTML 源代码：如果您通过 URL 请求网页，则始终会获得 HTML 源代码。事实上，您无法从 URL 中获得任何其他信息。网页总是以源码形式传输，没有编译好的网页。对于您正在尝试的内容，这应该足以完成您的任务。

Script Source:If the webpage is dynamically generated, then it is coded in some server side scripting language (like PHP, Ruby, JSP...). There also existsa source code at this level. But using a HTTP-connection you are not able to get this kind of source code. This is not a missing featurebut completely by purpose.

脚本源：如果网页是动态生成的，那么它是用某种服务器端脚本语言（如 PHP、Ruby、JSP...）编码的。在此级别还存在源代码。但是使用 HTTP 连接您无法获得这种源代码。这不是缺失的功能，而是完全出于目的。

Parsing:Having that said, you will need to somehow parse the HTML code. If you just need the links, using a RegEx (as Oscar Reyes showed) will be the most practical approach, but you could also write a simple parser "manually". It would be slow, more code... but works.

解析：话虽如此，您将需要以某种方式解析 HTML 代码。如果您只需要链接，使用 RegEx（如 Oscar Reyes 所示）将是最实用的方法，但您也可以“手动”编写一个简单的解析器。这会很慢，更多的代码......但有效。

If you want to acess the code on a more logical level, parsing it to a DOM would be the way to go. If the code is valid XHTMLyou can just parse it to a org.w3c.dom.Document and do anything with it. If it is at least valid HTMLyou might apply some tricks to convert it to XHTML (in some rare cases, replacing <br> by <br/> and changing the doctype is enough) and use it as XML.

如果您想在更合乎逻辑的层面上访问代码，将其解析为 DOM 将是一种可行的方法。如果代码是有效的 XHTML，您可以将其解析为 org.w3c.dom.Document 并对其进行任何操作。如果它至少是有效的 HTML，您可能会应用一些技巧将其转换为 XHTML（在极少数情况下，将 <br> 替换为 <br/> 并更改文档类型就足够了）并将其用作 XML。

If it's not valid XML, you would need an HTML DOM parser. I've no idea if such a thing exists for Java and if it performs nice.

如果它不是有效的 XML，则需要一个 HTML DOM 解析器。我不知道 Java 是否存在这样的东西，以及它是否表现良好。

Answer 7

回答by Nas Banov

There exists FTP server that can be installed on your Tivo to allow for show downloads, see here http://dvrpedia.com/MFS_FTP

存在可以安装在您的 Tivo 上以允许显示下载的 FTP 服务器，请参阅此处http://dvrpedia.com/MFS_FTP

The question is formulated differently (how to handle http/html in java) but at the end you mention what you want is to download shows. Tivo uses unique file system (MFS - Media File System) of their own, so it is not easy to mount the drive on another machine - instead it is easier to run http or ftp server on the Tivo and download from these

问题的表述方式不同（如何在 Java 中处理 http/html），但最后您提到您想要下载节目。Tivo 使用自己独特的文件系统（MFS - 媒体文件系统），因此在另一台机器上安装驱动器并不容易 - 相反，在 Tivo 上运行 http 或 ftp 服务器并从这些服务器下载更容易

Answer 8

回答by optimus0127

Try using the jsoup library.

尝试使用 jsoup 库。

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;


public class ParseHTML {

    public static void main(String args[]) throws IOException{
        Document doc = Jsoup.connect("https://www.wikipedia.org/").get();
        String text = doc.body().text();

        System.out.print(text);
    }
}

You can download the jsoup library here.

您可以在此处下载 jsoup 库。

在java中获取网站源

提问by Adam Lerman

采纳答案by Bernie Perez

回答by Peter Hilton

回答by Luca Matteis

回答by OscarRyz

回答by matt b

回答by Lena Schimmel

回答by Nas Banov

回答by optimus0127

相关推荐

最近更新

标签

在java中获取网站源

提问by Adam Lerman

采纳答案by Bernie Perez

回答by Peter Hilton

回答by Luca Matteis

回答by OscarRyz

回答by matt b

回答by Lena Schimmel

回答by Nas Banov

回答by optimus0127

相关推荐

Java Mockito when().thenReturn() 不能正常工作

为什么在 Java 中没有初始化局部变量？

Java POST JSON 请求中的字符编码

Java中long、double、byte、char的用途是什么？

相关推荐

最近更新

标签