Java - 在网站内搜索数据

Question

提问by AdianDes

I'm new to java and having some problems.

我是 Java 新手，遇到了一些问题。

The main idea is to connect to a website and collect information off it and store it in an array.

主要思想是连接到一个网站并从它收集信息并将其存储在一个数组中。

What I want the program to do is to search the website find a key word, and store what comes after the key word..

我想让程序做的是搜索网站找到一个关键字，并存储关键字后面的内容..

on the front page of daniweb along the bottom of the website there is a section called "Tag Cloud" which is filled with tags / short words

在daniweb的首页沿着网站底部有一个叫做“标签云”的部分，里面充满了标签/短词

Tag Cloud: "i want to store what is written here"

标签云：“我想存储这里写的内容”

My idea is to first read in the html of the website and then search that file for the key word followed by the text using Scanner and StringTokenizer then store as a array.

我的想法是首先读取网站的 html，然后使用 Scanner 和 StringTokenizer 在该文件中搜索关键字后跟文本，然后存储为数组。

is there a better way / easier?

有没有更好的方法/更简单的方法？

where do you suggest i look for some examples

你建议我在哪里找一些例子

here is what i have so far.

这是我到目前为止所拥有的。

import java.net.*;
import java.io.*;

public class URLReader {

    public static void main(String[] args) throws Exception {

        URL dweb = new URL("http://www.daniweb.com/");
        URLConnection dw = dweb.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(hc.getInputStream()));
        System.out.println("connected to daniweb");
        String inputLine;

        PrintStream out = new PrintStream(new FileOutputStream("OutFile.txt"));

        try {
        while ((inputLine = in.readLine()) != null)
            out.println(inputLine);

            //System.out.println(inputLine);
            //in.close();
        out.close();
        System.out.println("printed text to outfile");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        try {
            Scanner scan = new Scanner(OutFile.txt);
            String search = txtSearch.getText();
            while (scan.hasNextLine()) {
                line = scan.nextLine();
            //still working
                while (st.hasMoreTokens()) {
                    word = st.nextToken();
                    if (word == search) {

                    } else {

                    }
                }
            }
            scan.close();
            SearchWin.dispose();
        } catch (IOException iox) {
        }
    }

any help at all would be very much appreciated!

任何帮助都将不胜感激！

Answer 1

采纳答案by Jeff

I recommend jsoup. It will retrieve and parse the page for you.

我推荐jsoup。它将为您检索和解析页面。

On daniweb, each tag cloud link has the CSS class tagcloudlink. So you just need to tell jsoup to extract all text in tags that have the class tagcloudlink.

在 daniweb 上，每个标签云链接都有 CSS 类tagcloudlink。因此，您只需要告诉 jsoup 提取具有 class 的标签中的所有文本tagcloudlink。

This is off the top of my head plus some help from the jsoup site; I haven't tested it but it should get you started:

这是我的头顶加上来自 jsoup 站点的一些帮助；我还没有测试过，但它应该让你开始：

List<String> tags = new ArrayList<String>();
Document doc = Jsoup.connect("http://daniweb.com/").get();
Elements taglinks = doc.select("a.tagcloudlink");
for (Element link : taglinks) {
    tags.add(link.text());
}

Answer 2

回答by Corv1nus

You could use HTML Parser for this. Here is a link to it: HTML Parser. Another one I've used a lot and like is Jericho HTML Parser. Here is a link: Jericho HTML Parser

您可以为此使用 HTML 解析器。这是它的链接：HTML Parser。另一个我经常使用并且喜欢的是 Jericho HTML Parser。这是一个链接：Jericho HTML Parser

Java - 在网站内搜索数据

提问by AdianDes

采纳答案by Jeff

回答by Corv1nus

相关推荐

最近更新

标签

Java - 在网站内搜索数据

提问by AdianDes

采纳答案by Jeff

回答by Corv1nus

相关推荐

java log4j：Tomcat 6 的错误

如何选择 Java-COM 桥接器？

执行错误：java -Xms512M -Xmx1024M

如何使用泛型在 Java 中从 List<?> 转换为 List<T>？

相关推荐

最近更新

标签