Java - 在网站内搜索数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/3565780/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Java - Searching For Data within a Website
提问by AdianDes
I'm new to java and having some problems.
我是 Java 新手,遇到了一些问题。
The main idea is to connect to a website and collect information off it and store it in an array.
主要思想是连接到一个网站并从它收集信息并将其存储在一个数组中。
What I want the program to do is to search the website find a key word, and store what comes after the key word..
我想让程序做的是搜索网站找到一个关键字,并存储关键字后面的内容..
on the front page of daniweb along the bottom of the website there is a section called "Tag Cloud" which is filled with tags / short words
在daniweb的首页沿着网站底部有一个叫做“标签云”的部分,里面充满了标签/短词
Tag Cloud: "i want to store what is written here"
标签云:“我想存储这里写的内容”
My idea is to first read in the html of the website and then search that file for the key word followed by the text using Scanner and StringTokenizer then store as a array.
我的想法是首先读取网站的 html,然后使用 Scanner 和 StringTokenizer 在该文件中搜索关键字后跟文本,然后存储为数组。
is there a better way / easier?
有没有更好的方法/更简单的方法?
where do you suggest i look for some examples
你建议我在哪里找一些例子
here is what i have so far.
这是我到目前为止所拥有的。
import java.net.*;
import java.io.*;
public class URLReader {
public static void main(String[] args) throws Exception {
URL dweb = new URL("http://www.daniweb.com/");
URLConnection dw = dweb.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(hc.getInputStream()));
System.out.println("connected to daniweb");
String inputLine;
PrintStream out = new PrintStream(new FileOutputStream("OutFile.txt"));
try {
while ((inputLine = in.readLine()) != null)
out.println(inputLine);
//System.out.println(inputLine);
//in.close();
out.close();
System.out.println("printed text to outfile");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
Scanner scan = new Scanner(OutFile.txt);
String search = txtSearch.getText();
while (scan.hasNextLine()) {
line = scan.nextLine();
//still working
while (st.hasMoreTokens()) {
word = st.nextToken();
if (word == search) {
} else {
}
}
}
scan.close();
SearchWin.dispose();
} catch (IOException iox) {
}
}
any help at all would be very much appreciated!
任何帮助都将不胜感激!
采纳答案by Jeff
I recommend jsoup. It will retrieve and parse the page for you.
我推荐jsoup。它将为您检索和解析页面。
On daniweb, each tag cloud link has the CSS class tagcloudlink. So you just need to tell jsoup to extract all text in tags that have the class tagcloudlink.
在 daniweb 上,每个标签云链接都有 CSS 类tagcloudlink。因此,您只需要告诉 jsoup 提取具有 class 的标签中的所有文本tagcloudlink。
This is off the top of my head plus some help from the jsoup site; I haven't tested it but it should get you started:
这是我的头顶加上来自 jsoup 站点的一些帮助;我还没有测试过,但它应该让你开始:
List<String> tags = new ArrayList<String>();
Document doc = Jsoup.connect("http://daniweb.com/").get();
Elements taglinks = doc.select("a.tagcloudlink");
for (Element link : taglinks) {
tags.add(link.text());
}
回答by Corv1nus
You could use HTML Parser for this. Here is a link to it: HTML Parser. Another one I've used a lot and like is Jericho HTML Parser. Here is a link: Jericho HTML Parser
您可以为此使用 HTML 解析器。这是它的链接:HTML Parser。另一个我经常使用并且喜欢的是 Jericho HTML Parser。这是一个链接:Jericho HTML Parser

