database 大型公共数据集?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/381806/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 07:07:54  来源:igfitidea点击:

Large public datasets?

databaseperformancedatasetbenchmarking

提问by

I am looking for some large public datasets, in particular:

我正在寻找一些大型公共数据集,特别是:

  1. Large sample web server logs that have been anonymized.

  2. Datasets used for database performance benchmarking.

  1. 已匿名的大型示例 Web 服务器日志。

  2. 用于数据库性能基准测试的数据集。

Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/

任何其他指向大型公共数据集的链接将不胜感激。我已经在以下位置了解 Amazon 的公共数据集:http: //aws.amazon.com/publicdatasets/

回答by MrGomez

1. Large sample web server logs that have been anonymized.

1. 已匿名化的大型示例 Web 服务器日志。

These work to start with:

这些工作开始于:

There are many, many more data sets available than these (see the gamut of other answers), but this is the lowest hanging fruit that meets your original criteria. As a bonus, they have a contact linkif you have specific needs they may know of.

可用的数据集比这些多得多(请参阅其他答案的范围),但这是符合您原始标准的最低限度的悬而未决的成果。作为奖励,如果您有他们可能知道的特定需求,他们会提供联系链接

2. Datasets used for database performance benchmarking.

2. 用于数据库性能基准测试的数据集。

This sounds like a misnomer, because you're asking for empirical data sets that describe well-definedalgorithmicproblems. Specifically, it sounds like you're trying to find sets of data that you can use to test and benchmark various database systems in real time, using well-defined, normalized relational data that can be used as a set of test cases for determining the most efficient solution that meets your needs.

这听起来像是用词不当,因为您要求的是描述明确算法问题的经验数据集。具体来说,听起来您正在尝试使用定义明确的规范化关系数据来查找可用于实时测试和基准测试各种数据库系统的数据集,这些数据可用作一组测试用例来确定最有效的解决方案,满足您的需求。

I don't agree with this approach. Instead of finding a litany of database systems and their canned implementations, it's far better to explore the algorithmicguaranteesof these systems as your first port of call. Once you've determined the algorithmic constraints that meet your needs, you can hone in on a set of canned solutions that you can benchmark on efficiency of, for example, indexing, sorting, searching, insertion, deletion, and retrieval.

我不同意这种做法。与其寻找一连串的数据库系统及其固定实现,不如探索这些系统的算法保证作为您的第一站。一旦您确定了满足您需求的算法约束,您就可以研究一组固定的解决方案,您可以对这些解决方案的效率进行基准测试,例如索引、排序、搜索、插入、删除和检索。

Wikipedia provides a terse article on database testing conceptsthat you can use to determine and write test cases for benchmarking performance. For example, you might use an agnostic data access interface like JDBCand JDBC Benchmarkto determine the relative timings of each operation. From here, you can hone in on a correct solution.

维基百科提供了一篇关于数据库测试概念的简洁文章,您可以使用它来确定和编写测试用例以进行性能基准测试。例如,您可以使用不可知的数据访问接口(如JDBCJDBC Benchmark)来确定每个操作的相对时间。从这里,您可以磨练正确的解决方案。

In short,go to the researchfirst for determining database guarantees. Once a set of candidate solutions has been identified, you can select amongst those by testing (or otherwise determining) the constant time performance of each desired operation.

总之,研究首先确定数据库的保证。一旦确定了一组候选解决方案,您就可以通过测试(或以其他方式确定)每个所需操作的恒定时间性能来从中进行选择。

回答by caesar0301

Based on Quora answersand my personal collections in my studies, an awesome-public-datasetsrepository was created and updated lively on GitHub:

根据Quora 的回答和我在研究中的个人收藏,在 GitHub 上创建和更新了一个很棒的公共数据集存储库:

Below is a snapshot version of this list. For a newest list, please visit Github:

以下是此列表的快照版本。如需最新列表,请访问Github

This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.

这个公共数据源列表是从博客、答案和用户响应中收集和整理的。下面列出的大多数数据集都是免费的,但也有一些不是。此列表来自https://github.com/caesar0301/awesome-public-datasets

Climate

气候

Economics

经济学

Finance

金融

Biology

生物学

Physics

物理

Healthcare

卫生保健

  • EHDP 大型健康数据集:http://www.ehdp.com/vitalnet/datasets.htm
  • Gapminder:http://www.gapminder.org/data/
  • 医疗保险数据文件:http: //go.cms.gov/19xxPN4

GeoSpace

地理空间

Transportation

运输

Government

政府

Data Challenges

数据挑战

Machine Learning

机器学习

Natural Language

自然语言

Image Processing

图像处理

Time Series

时间序列

Social Sciences

社会科学

Complex Networks

复杂网络

Computer Networks

计算机网络

Data SEs

数据SE

Public Doamins

公共领域

Complementary Collections

补充收藏

回答by Gene De Lisa

Here are several. Have fun.

这里有几个。玩得开心。

http://archive.ics.uci.edu/ml/

http://archive.ics.uci.edu/ml/

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1

http://crawdad.org/

http://crawdad.org/

http://data.austintexas.gov

http://data.austintexas.gov

http://data.cityofchicago.org

http://data.cityofchicago.org

http://data.govloop.com

http://data.govloop.com

http://data.gov.uk/

http://data.gov.uk/

http://data.medicare.gov

http://data.medicare.gov

http://data.seattle.gov

http://data.seattle.gov

http://data.sfgov.org

http://data.sfgov.org

http://data.sunlightlabs.com

http://data.sunlightlabs.com

https://datamarket.azure.com/

https://datamarket.azure.com/

http://ftp.ncbi.nih.gov/

http://ftp.ncbi.nih.gov/

http://gettingpastgo.socrata.com

http://gettingpastgo.socrata.com

http://books.google.com/ngrams/

http://books.google.com/ngrams/

http://linkeddata.org/

http://linkeddata.org/

http://medihal.archives-ouvertes.fr

http://medihal.archives-ouvertes.fr

http://public.resource.org/

http://public.resource.org/

http://rechercheisidore.fr

http://rechercheisidore.fr

http://reddit.com/r/datasets

http://reddit.com/r/datasets

http://timetric.com/public-data/

http://timetric.com/public-data/

http://www2.jpl.nasa.gov/srtm

http://www2.jpl.nasa.gov/srtm

http://www.bls.gov/

http://www.bls.gov/

http://www.crunchbase.com/

http://www.crunchbase.com/

http://www.dartmouthatlas.org/

http://www.dartmouthatlas.org/

http://www.data.gov/

http://www.data.gov/

http://www.datakc.org

http://www.datakc.org

http://www.factual.com/

http://www.factual.com/

http://www.freebase.com/

http://www.freebase.com/

http://www.infochimps.com

http://www.infochimps.com

http://www.kaggle.com/

http://www.kaggle.com/

http://build.kiva.org/

http://build.kiva.org/

http://www.imdb.com/interfaces

http://www.imdb.com/interfaces

http://dbpedia.org

http://dbpedia.org

回答by Jason S

Just a thought:

只是一个想法:

回答by Carter Medlin

Google Fusion Tables has a few.

Google Fusion Tables 有一些。

http://tables.googlelabs.com/

http://tables.googlelabs.com/

回答by kemiller2002

Well for the web server logs you could always just generate them for the format you need. If you are going to test code against it etc. it will have to be tailored to the fields you want to store/parse.

好吧,对于 Web 服务器日志,您总是可以根据需要的格式生成它们。如果您要针对它等测试代码,则必须针对您要存储/解析的字段进行定制。

For the datasets used for database performance benchmarking, you'll probably want to look at a tool that can generate data for you. Red Gate has a great one for not too much money.

对于用于数据库性能基准测试的数据集,您可能希望查看可以为您生成数据的工具。红门有一个很棒的,花不了多少钱。

回答by viper

Datasets available hereas well.

此处也提供数据集。

回答by Rishi

Kaggle.com frequently has datamining challenges. The datasets cover a wide range of fienlds: healthcare provider data to credit history information. Perhaps something there is what you're after.

Kaggle.com 经常面临数据挖掘挑战。数据集涵盖了广泛的领域:医疗保健提供者数据到信用历史信息。也许有些东西是你所追求的。

回答by Brian Risk

http://Quandl.comhas over 10 million data sets gleaned from all over the internet. The great thing about this resource is that it gives a single way to access all of the data. The site has a free Excel plug in or there are libraries in R, Python, Ruby, etc.

http://Quandl.com拥有从互联网上收集的超过 1000 万个数据集。该资源的优点在于它提供了一种访问所有数据的方式。该站点有一个免费的 Excel 插件,或者有 R、Python、Ruby 等库。

回答by zeroDivisible

Well, this one is new and there is a challenge behind it:

嗯,这是一个新的,它背后有一个挑战:

Million song dataset challenge

百万歌曲数据集挑战