database 如何从网站收集数据
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/8549910/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to collect data from a website
提问by Mr. MonoChrome
Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.
前言:我对少数语言(C++、VB、C#、Java、许多网络语言)有广泛的大学知识,所以随你喜欢。
I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.
我想制作一个比较数字的 android 应用程序,但为了做到这一点,我需要一个数据库。我是一个单人团队,数字每两周更新一次,所以我想从一个同时更新的 wiki 中获取这些数字。
So my question is: how can I access information from a website using one of the languages above?
所以我的问题是:如何使用上述语言之一从网站访问信息?
回答by JP Beaudry
What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).
我理解的问题是:某些实体每隔一周生成一个数据集(即数字),您需要下载该数据集进行处理(例如排序)。
Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.
理想情况下,维护 wiki 的网站将提供一个服务,如RESTful 接口,以轻松收集数据。如果是这种情况,我会选择任何可以轻松操作 HTTP 请求和响应并使您的数据操作变得容易的语言。正如之前的海报所说,Java 会运行良好。
If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.
如果您被维基页面卡住了,您有几个选择。您可以解析浏览器接收到的 HTML(Perl 被认为是一种不错的语言)。或者您可以使用为此目的构建的工具,例如前面提到的 Jsoup。
Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.
您的问题还提到了一些实现细节,例如需要数据库。显然,没有足够的上下文信息让我知道这是否是最佳的,所以我不会解决问题的这个方面。
回答by pbojinov
http://jsoup.org/is a great Java tool for accessing content on html pages
http://jsoup.org/是一个很棒的 Java 工具,用于访问 html 页面上的内容
回答by Marcin
Consider https://scraperwiki.com/- it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.
考虑https://scraperwiki.com/- 这是一个用户可以贡献刮板的网站。只要您让刮板公开,它就是免费的。刮板的结果显示为 csv 和 JSON。
If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.
如果您不知道什么是“刮板”,请在谷歌上搜索“屏幕刮板”——对于编码人员来说,这是一个漫长而令人沮丧的传统,他们从网络计算开始就遇到了与您相同的问题。
回答by Sudhir Bastakoti
You could check out :http://web-harvest.sourceforge.net/
您可以查看:http://web-harvest.sourceforge.net/
回答by Etienne Perot
For Python, BeautifulSoupis one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.
对于 Python,BeautifulSoup是目前最宽容的 HTML 解析器之一。该文档还列出了 Ruby 和 Java 中的类似库,因此您可能会在那里找到相关内容。

