database 大型公共数据集？

Question

提问by

I am looking for some large public datasets, in particular:

我正在寻找一些大型公共数据集，特别是：

Large sample web server logs that have been anonymized.
Datasets used for database performance benchmarking.

已匿名的大型示例 Web 服务器日志。
用于数据库性能基准测试的数据集。

Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/

任何其他指向大型公共数据集的链接将不胜感激。我已经在以下位置了解 Amazon 的公共数据集：http: //aws.amazon.com/publicdatasets/

Answer 1

回答by MrGomez

1. Large sample web server logs that have been anonymized.

1. 已匿名化的大型示例 Web 服务器日志。

These work to start with:

这些工作开始于：

UCI Machine Learning Repository

UCI 机器学习库

There are many, many more data sets available than these (see the gamut of other answers), but this is the lowest hanging fruit that meets your original criteria. As a bonus, they have a contact linkif you have specific needs they may know of.

可用的数据集比这些多得多（请参阅其他答案的范围），但这是符合您原始标准的最低限度的悬而未决的成果。作为奖励，如果您有他们可能知道的特定需求，他们会提供联系链接。

2. Datasets used for database performance benchmarking.

2. 用于数据库性能基准测试的数据集。

This sounds like a misnomer, because you're asking for empirical data sets that describe well-defined algorithmic problems. Specifically, it sounds like you're trying to find sets of data that you can use to test and benchmark various database systems in real time, using well-defined, normalized relational data that can be used as a set of test cases for determining the most efficient solution that meets your needs.

这听起来像是用词不当，因为您要求的是描述明确算法问题的经验数据集。具体来说，听起来您正在尝试使用定义明确的规范化关系数据来查找可用于实时测试和基准测试各种数据库系统的数据集，这些数据可用作一组测试用例来确定最有效的解决方案，满足您的需求。

I don't agree with this approach. Instead of finding a litany of database systems and their canned implementations, it's far better to explore the algorithmic guaranteesof these systems as your first port of call. Once you've determined the algorithmic constraints that meet your needs, you can hone in on a set of canned solutions that you can benchmark on efficiency of, for example, indexing, sorting, searching, insertion, deletion, and retrieval.

我不同意这种做法。与其寻找一连串的数据库系统及其固定实现，不如探索这些系统的算法保证作为您的第一站。一旦您确定了满足您需求的算法约束，您就可以研究一组固定的解决方案，您可以对这些解决方案的效率进行基准测试，例如索引、排序、搜索、插入、删除和检索。

Wikipedia provides a terse article on database testing conceptsthat you can use to determine and write test cases for benchmarking performance. For example, you might use an agnostic data access interface like JDBCand JDBC Benchmarkto determine the relative timings of each operation. From here, you can hone in on a correct solution.

维基百科提供了一篇关于数据库测试概念的简洁文章，您可以使用它来确定和编写测试用例以进行性能基准测试。例如，您可以使用不可知的数据访问接口（如JDBC和JDBC Benchmark）来确定每个操作的相对时间。从这里，您可以磨练正确的解决方案。

In short,go to the researchfirst for determining database guarantees. Once a set of candidate solutions has been identified, you can select amongst those by testing (or otherwise determining) the constant time performance of each desired operation.

总之，去研究首先确定数据库的保证。一旦确定了一组候选解决方案，您就可以通过测试（或以其他方式确定）每个所需操作的恒定时间性能来从中进行选择。

Answer 2

回答by caesar0301

Based on Quora answersand my personal collections in my studies, an awesome-public-datasetsrepository was created and updated lively on GitHub:

根据Quora 的回答和我在研究中的个人收藏，在 GitHub 上创建和更新了一个很棒的公共数据集存储库：

Below is a snapshot version of this list. For a newest list, please visit Github:

以下是此列表的快照版本。如需最新列表，请访问Github：

This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.

这个公共数据源列表是从博客、答案和用户响应中收集和整理的。下面列出的大多数数据集都是免费的，但也有一些不是。此列表来自https://github.com/caesar0301/awesome-public-datasets。

Climate

气候

Australian Weather: http://www.bom.gov.au/climate/dwo/
Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datterand ftp://ftp.cmdl.noaa.gov/
Global climate data since 1929: http://www.tutiempo.net/en/Climate
NOAA Bering Sea Climate: http://www.beringclimate.noaa.gov/
NOAA climate datasets: http://ncdc.noaa.gov/data-access/quick-links
WU Historical Weather Worldwide: http://www.wunderground.com/history/index.html

澳大利亚天气：http: //www.bom.gov.au/climate/dwo/
气候数据：http: //www.cru.uea.ac.uk/cru/data/temperature/#datter和ftp://ftp.cmdl.noaa.gov/
自 1929 年以来的全球气候数据：http: //www.tutiempo.net/en/Climate
NOAA 白令海气候：http: //www.beringclimate.noaa.gov/
NOAA 气候数据集：http: //ncdc.noaa.gov/data-access/quick-links
WU 全球历史天气：http: //www.wunderground.com/history/index.html

Economics

经济学

American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
EconData (UMD): http://inforumweb.umd.edu/econdata/econdata.html
Internet Product Code Database: http://www.upcdatabase.com/
World bank: http://data.worldbank.org/indicator

美国经济协会。(AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
经济数据（UMD）：http://inforumweb.umd.edu/econdata/econdata.html
互联网产品代码数据库：http: //www.upcdatabase.com/
世界银行：http: //data.worldbank.org/indicator

Finance

金融

CBOE Futures Exchange: http://cfe.cboe.com/Data/
Google Finance: https://www.google.com/finance
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
NASDAQ: https://data.nasdaq.com/
OANDA: http://www.oanda.com/
OSU Financial data: http://fisher.osu.edu/fin/osudata.htm
Quandl: http://www.quandl.com/
St Louis Federal: http://research.stlouisfed.org/fred2/
Yahoo Finance: http://finance.yahoo.com/

芝加哥期权交易所：http: //cfe.cboe.com/Data/
谷歌财经：https: //www.google.com/finance
谷歌趋势：http: //www.google.com/trends?q =google&ctab =0&geo =all&date =all&sort =0
纳斯达克：https: //data.nasdaq.com/
OANDA：http: //www.oanda.com/
俄勒冈州立大学财务数据：http: //fisher.osu.edu/fin/osudata.htm
Quandl：http://www.quandl.com/
圣路易斯联邦：http: //research.stlouisfed.org/fred2/
雅虎财经：http: //finance.yahoo.com/

Biology

生物学

CRCNS: http://crcns.org/data-sets
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
Stanford Microarray Data: http://smd.stanford.edu/
UniGene: http://www.ncbi.nlm.nih.gov/unigene

CRCNS：http://crcns.org/data-sets
基因表达综合：http: //www.ncbi.nlm.nih.gov/geo/
人类微生物组计划：http: //www.hmpdacc.org/reference_genomes/reference_genomes.php
麻省理工学院癌症基因组数据：http: //www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NIH 微阵列数据：ftp: //ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
蛋白质结构：http: //www.infobiotic.net/PSPbenchmarks/
公共基因数据：http: //www.pubgene.org/
斯坦福微阵列数据：http: //smd.stanford.edu/
UniGene：http://www.ncbi.nlm.nih.gov/unigene

Physics

物理

NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html

美国宇航局：http: //nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html

Healthcare

卫生保健

EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm
Gapminder: http://www.gapminder.org/data/
Medicare Data File: http://go.cms.gov/19xxPN4

EHDP 大型健康数据集：http://www.ehdp.com/vitalnet/datasets.htm
Gapminder：http://www.gapminder.org/data/
医疗保险数据文件：http: //go.cms.gov/19xxPN4

GeoSpace

地理空间

EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse
Factual Global Location Data: http://www.factual.com/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/

EOSDIS：http://sedac.ciesin.columbia.edu/data/sets/browse
事实全球位置数据：http: //www.factual.com/
地理空间数据：http: //geodacenter.asu.edu/datalist/

Transportation

运输

Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations
Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229
Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/
NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013
OpenFlights (airport, airline and route data): http://openflights.org/data.html
RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120
RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp
Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm

航空公司数据（2009 年 ASA 挑战赛）：http: //stat-computing.org/dataexpo/2009/the-data.html
机场及其位置：http: //www.infochimps.com/datasets/airports-and-their-locations
自行车共享数据系统：https: //github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
1990 年至 2009 年美国国内航班的边缘数据：http: //data.memect.com/?p=229
50 万次 Hubway 骑行：http://hubwaydatachallenge.org/trip-history-data/
纽约市出租车行程数据 2013 (FOIA/FOIL)：https: //archive.org/details/nycTaxiTripData2013
OpenFlights（机场、航空公司和航线数据）：http: //openflights.org/data.html
RITA 航空公司准点率数据：http: //www.transtats.bts.gov/Tables.asp?DB_ID=120
RITA 交通数据收集：http: //www.transtats.bts.gov/DataIndex.asp
伦敦交通局：http: //www.tfl.gov.uk/info-for/open-data-users/our-feeds
美国货运分析框架：http: //ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm

Government

政府

Archive-it: : https://www.archive-it.org/explore?show=Collections
Australia: http://www.abs.gov.au/AUSSTATS/[email protected]/DetailsPage/3301.02009?OpenDocument
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
Chicago: https://data.cityofchicago.org/
FDA: https://open.fda.gov/index.html
Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
Guardian world governments: http://www.guardian.co.uk/world-government-data
HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
London Datastore, U.K: http://data.london.gov.uk/dataset
New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx
NYC betanyc: http://betanyc.us/
NYC Open Data: http://nycplatform.socrata.com/
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
San Francisco Data sets: http://datasf.org/
The World Bank: http://wdronline.worldbank.org/
U.K. Government Data: http://data.gov.uk/data
U.S. Census Bureau: http://www.census.gov/data.html
U.S. Federal Government Agencies: http://www.data.gov/metric
U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
U.S. Open Government: http://www.data.gov/open-gov/
UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/
United Nations: http://data.un.org/
US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm

存档它：：https: //www.archive-it.org/explore?show=Collections
澳大利亚：http: //www.abs.gov.au/AUSSTATS/[email protected]/DetailsPage/3301.02009?OpenDocument
加拿大：http: //www.data.gc.ca/default.asp?lang=En&n =5BCD274E-1
芝加哥：https: //data.cityofchicago.org/
FDA：https: //open.fda.gov/index.html
美联储统计：http: //www.fedstats.gov/cgi-bin/A2Z.cgi
守护世界政府：http: //www.guardian.co.uk/world-government-data
HUD：http: //www.huduser.org/portal/datasets/pdrdatas.html
英国伦敦数据存储：http: //data.london.gov.uk/dataset
新西兰：http: //www.stats.govt.nz/browse_for_stats.aspx
纽约市 betanyc：http://betanyc.us/
纽约市开放数据：http: //nycplatform.socrata.com/
经合组织：http: //www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
丽塔：http: //www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
旧金山数据集：http: //datasf.org/
世界银行：http: //wdronline.worldbank.org/
英国政府数据：http: //data.gov.uk/data
美国人口普查局：http: //www.census.gov/data.html
美国联邦政府机构：http: //www.data.gov/metric
美国联邦政府数据目录：http: //catalog.data.gov/dataset
美国开放政府：http: //www.data.gov/open-gov/
英国 2011 年人口普查开放地图集项目：http: //www.alex-singleton.com/2011-census-open-atlas-project/
联合国：http: //data.un.org/
美国 CDC 公共卫生数据集：http: //www.cdc.gov/nchs/data_access/ftp_data.htm

Data Challenges

数据挑战

Challenges in Machine Learning: http://www.chalearn.org/
ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/
Kaggle Competition Data: http://www.kaggle.com/
KDD Cup by Tencent 2012: https://www.kddcup2012.org/
Netflix Prize: http://www.netflixprize.com/leaderboard
Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge

机器学习的挑战：http: //www.chalearn.org/
ICWSM 数据挑战赛（自 2009 年起）：http://icwsm.cs.umbc.edu/
Kaggle 比赛数据：http://www.kaggle.com/
2012腾讯KDD杯：https: //www.kddcup2012.org/
Netflix 奖：http: //www.netflixprize.com/leaderboard
Yelp 数据集挑战赛：http: //www.yelp.com/dataset_challenge

Machine Learning

机器学习

eBay Online Auctions: http://www.modelingonlineauctions.com/datasets
IMDb database: http://www.imdb.com/interfaces
Keel Repository: http://sci2s.ugr.es/keel/datasets.php
Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action
Machine Learning Data Set Repository: http://mldata.org/
Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens Data Sets: http://datahub.io/dataset/movielens
RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data
Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html
Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r

eBay 在线拍卖：http: //www.modelingonlineauctions.com/datasets
IMDb 数据库：http: //www.imdb.com/interfaces
龙骨存储库：http: //sci2s.ugr.es/keel/datasets.php
Lending Club 贷款数据：https: //www.lendingclub.com/info/download-data.action
机器学习数据集存储库：http: //mldata.org/
百万歌曲数据集：http: //blog.echonest.com/post/3639160982/million-song-dataset
更多歌曲数据集：http: //labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens 数据集：http: //datahub.io/dataset/movielens
RDataMining R 和数据挖掘电子书数据：http://www.rdatamining.com/data
在地球上注册的陨石：http: //www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
SF 餐厅数据集：http: //missionlocal.org/san-francisco-restaurant-health-inspections/
UCI 机器学习库：http: //archive.ics.uci.edu/ml/
多伦多大学 Delve 数据集：http: //www.cs.toronto.edu/~delve/data/datasets.html
雅虎评级和分类数据：http://webscope.sandbox.yahoo.com/catalog.php?datatype =r

Natural Language

自然语言

40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list
ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/
ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/
Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670
Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
Hansards: http://www.isi.edu/natural-language/download/hansard/
Machine Translation: http://statmt.org/wmt11/translation-task.html#download
SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
WordNet: http://wordnet.princeton.edu/wordnet/download/

上下文中的 4000 万个实体：https: //code.google.com/p/wiki-links/downloads/list
ClueWeb09 FACC：http://lemurproject.org/clueweb09/FACC1/
ClueWeb12 FACC：http://lemurproject.org/clueweb12/FACC1/
Flickr 个人分类法：http: //www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
谷歌图书 Ngrams：http://aws.amazon.com/datasets/8172056142375670
Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
古腾堡电子书列表：http: //www.gutenberg.org/wiki/Gutenberg: Offline_Catalogs
手册：http://www.isi.edu/natural-language/download/hansard/
机器翻译：http: //statmt.org/wmt11/translation-task.html#download
垃圾短信收集：http: //www.dt.fee.unicamp.br/~tiago/smsspamcollection/
USENET 语料库：http: //www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
WordNet：http: //wordnet.princeton.edu/wordnet/download/

Image Processing

图像处理

2GB of photos of cats: http://bit.do/UJZZ
Face Recognition Benchmark: http://www.face-rec.org/databases/
ImageNet: http://www.image-net.org/

2GB 猫的照片：http: //bit.do/UJZZ
人脸识别基准：http: //www.face-rec.org/databases/
ImageNet：http: //www.image-net.org/

Time Series

时间序列

Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl
UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/

时间序列数据库：https: //datamarket.com/data/list/?q=provider: tsdl
加州大学河滨时间序列：http: //www.cs.ucr.edu/~eamonn/time_series_data/

Social Sciences

社会科学

China Hotel Checkin/out data: http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml
CMU Enron Email: http://www.cs.cmu.edu/~enron/
Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php
Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn
General Social Survey (GSS): http://www3.norc.org/GSS+Website/
GetGlue (users rating TV shows): http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz
GitHub Archive: http://www.githubarchive.org/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
SourceForge Graph: http://www.nd.edu/~oss/Data/data.html
Titanic Survival Data Set: https://github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html
UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/
UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php
Universities Worldwide: http://univ.cc/
UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html
Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g
Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/

CN 酒店入住/退房数据：http: //www.360doc.com/content/13/1105/13/7863900_326788919.shtml
CMU 安然电子邮件：http: //www.cs.cmu.edu/~enron/
Facebook 社交网络（自 2007 年起）：http: //law.di.unimi.it/datasets.php
Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
Foursquare（UMN/Sarwat，2013）：https: //archive.org/details/201309_foursquare_dataset_umn
一般社会调查（GSS）：http: //www3.norc.org/GSS+Website/
GetGlue（用户评分电视节目）：http: //getglue-data.s3.amazonaws.com/getglue_sample.tar.gz
GitHub 存档：http: //www.githubarchive.org/
ICPSR：http: //www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
移动社交网络 (UMASS)：https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
PewResearch 互联网项目：http://www.pewinternet.org/datasets/pages/2/
社交网络：http: //www.cs.cmu.edu/~jelsas/data/ancestry.com/
SourceForge 图：http: //www.nd.edu/~oss/Data/data.html
泰坦尼克号生存数据集：https: //github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
推特图：http: //an.kaist.ac.kr/traces/WWW2010.html
加州大学伯克利分校的 D-Lab 成就：http://ucdata.berkeley.edu/
加州大学洛杉矶分校社会科学数据档案：http: //dataarchives.ss.ucla.edu/Home.DataPortals.htm
UNIMI 社交网络数据集：http://law.di.unimi.it/datasets.php
全球大学：http: //univ.cc/
UPJOHN 就业研究：http://www.upjohn.org/erdc/erdc.html
雅虎图表和社交数据：http://webscope.sandbox.yahoo.com/catalog.php?datatype =g
Youtube Graph (2007,2008)：http: //netsg.cs.sfu.ca/youtubedata/

Complex Networks

复杂网络

CrossRef DOI URLs: https://archive.org/details/doi-urls
DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP
NBER Patent Citations: http://nber.org/patents/
NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html
Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
Scopus Citation Database: http://www.elsevier.com/online-tools/scopus
Stanford GraphBase (Steven Skiena): http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
The Koblenz Network Collection: http://konect.uni-koblenz.de/
UCI Network Data Repository: http://networkdata.ics.uci.edu/resources.php
UFL sparse matrix collection: http://www.cise.ufl.edu/research/sparse/matrices/
UNIMI Large Web Graph: http://law.di.unimi.it/datasets.php
WSU Graph Database: http://www.eecs.wsu.edu/mgd/gdb.html

CrossRef DOI URL：https: //archive.org/details/doi-urls
DBLP 引文数据集：https: //kdl.cs.umass.edu/display/public/DBLP
NBER 专利引用：http: //nber.org/patents/
NIST 复杂网络数据收集：http: //math.nist.gov/~RPozo/complex_datasets.html
蛋白质-蛋白质相互作用网络：http: //vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
PyPI 和 Maven 依赖网络：http: //ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
Scopus 引文数据库：http: //www.elsevier.com/online-tools/scopus
斯坦福 GraphBase (Steven Skiena)：http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
斯坦福大型网络数据集集合：http: //snap.stanford.edu/data/
科布伦茨网络收藏：http: //konect.uni-koblenz.de/
UCI 网络数据存储库：http: //networkdata.ics.uci.edu/resources.php
UFL 稀疏矩阵集合：http: //www.cise.ufl.edu/research/sparse/matrices/
UNIMI 大型网络图：http://law.di.unimi.it/datasets.php
WSU 图数据库：http: //www.eecs.wsu.edu/mgd/gdb.html

Computer Networks

计算机网络

3.5B Web Pages: http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
53.5B Web clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
CAIDA Internet Datasets: http://www.caida.org/data/overview/
ClueWeb09: http://lemurproject.org/clueweb09/
ClueWeb12: http://lemurproject.org/clueweb12/
CommonCrawl Web Data: http://commoncrawl.org/the-data/get-started/
Dartmouth CRAWDAD Wireless datasets: http://crawdad.cs.dartmouth.edu/
OpenMobileData (MobiPerf): https://console.developers.google.com/storage/openmobiledata_public/
UCSD Network Telescope: http://www.caida.org/projects/network_telescope/

3.5B 网页：http: //www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
53.5B 网页点击：http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
CAIDA 互联网数据集：http: //www.caida.org/data/overview/
ClueWeb09：http://lemurproject.org/clueweb09/
ClueWeb12：http://lemurproject.org/clueweb12/
CommonCrawl 网络数据：http://commoncrawl.org/the-data/get-started/
达特茅斯 CRAWDAD 无线数据集：http://crawdad.cs.dartmouth.edu/
OpenMobileData (MobiPerf)：https://console.developers.google.com/storage/openmobiledata_public/
UCSD 网络望远镜：http: //www.caida.org/projects/network_telescope/

Data SEs

数据SE

Academic Torrents: http://academictorrents.com/
Datahub.io: http://datahub.io/dataset
DataMarket: https://datamarket.com/data/list/?q=all
Harvard Dataverse: http://thedata.harvard.edu/dvn/
Statista: http://www.statista.com/
Freebase: http://www.freebase.com/

学术种子：http: //academictorrents.com/
Datahub.io：http://datahub.io/dataset
数据市场：https://datamarket.com/data/list/ ?q =all
哈佛数据节：http://thedata.harvard.edu/dvn/
Statista：http: //www.statista.com/
自由基地：http://www.freebase.com/

Public Doamins

公共领域

Amazon: http://aws.amazon.com/datasets
Archive.org Datasets: https://archive.org/details/datasets
CMU JASA data archive: http://lib.stat.cmu.edu/jasadata/
CMU StatLab collections: http://lib.stat.cmu.edu/datasets/
Data360: http://www.data360.org/index.aspx
Datamob.org: http://datamob.org/datasets
Google: http://www.google.com/publicdata/directory
infochimps: http://www.infochimps.com/
KDNuggets Data Collections: http://www.kdnuggets.com/datasets/index.html
Numbray: http://numbrary.com/
RevolutionAnalytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
Stats4Stem R data sets: http://www.stats4stem.org/data-sets.html
StatSci.org: http://www.statsci.org/datasets.html
The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
UCLA SOCR data collection: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UFO Reports: http://www.nuforc.org/webreports.html
Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
Yahoo Webscope: http://webscope.sandbox.yahoo.com/catalog.php

亚马逊：http: //aws.amazon.com/datasets
Archive.org 数据集：https: //archive.org/details/datasets
CMU JASA 数据存档：http: //lib.stat.cmu.edu/jasadata/
CMU StatLab 集合：http://lib.stat.cmu.edu/datasets/
Data360：http://www.data360.org/index.aspx
Datamob.org：http://datamob.org/datasets
谷歌：http: //www.google.com/publicdata/directory
信息黑猩猩：http: //www.infochimps.com/
KDNuggets 数据集：http://www.kdnuggets.com/datasets/index.html
麻麻：http://numbrary.com/
RevolutionAnalytics 集合：http: //www.revolutionanalytics.com/subscriptions/datasets/
R 数据集示例：http: //stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
Stats4Stem R 数据集：http: //www.stats4stem.org/data-sets.html
StatSci.org：http://www.statsci.org/datasets.html
华盛顿邮报列表：http: //www.washingtonpost.com/wp-srv/metro/data/datapost.html
加州大学洛杉矶分校 SOCR 数据收集：http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
UFO 报告：http: //www.nuforc.org/webreports.html
维基解密 911 寻呼机拦截：http: //911.wikileaks.org/files/index.html
雅虎网络镜：http://webscope.sandbox.yahoo.com/catalog.php

Complementary Collections

补充收藏

DataWrangling: http://www.datawrangling.com/some-datasets-available-on-the-web
Inside-r: http://www.inside-r.org/howto/finding-data-internet
Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
StaTrek: http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/

DataWrangling：http://www.datawrangling.com/some-datasets-available-on-the-web
内部-r：http: //www.inside-r.org/howto/finding-data-internet
Quora：http: //www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
RS 集合 100+：http: //rs.io/2014/05/29/list-of-data-sets.html
StaTrek：http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/

Answer 3

回答by Gene De Lisa

Here are several. Have fun.

这里有几个。玩得开心。

http://archive.ics.uci.edu/ml/

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1

http://crawdad.org/

http://data.austintexas.gov

http://data.cityofchicago.org

http://data.govloop.com

http://data.gov.uk/

http://data.medicare.gov

http://data.seattle.gov

http://data.sfgov.org

http://data.sunlightlabs.com

https://datamarket.azure.com/

http://ftp.ncbi.nih.gov/

http://gettingpastgo.socrata.com

http://books.google.com/ngrams/

http://linkeddata.org/

http://medihal.archives-ouvertes.fr

http://public.resource.org/

http://rechercheisidore.fr

http://reddit.com/r/datasets

http://timetric.com/public-data/

http://www2.jpl.nasa.gov/srtm

http://www.bls.gov/

http://www.crunchbase.com/

http://www.dartmouthatlas.org/

http://www.data.gov/

http://www.datakc.org

http://www.factual.com/

http://www.freebase.com/

http://www.infochimps.com

http://www.kaggle.com/

http://build.kiva.org/

http://www.imdb.com/interfaces

http://dbpedia.org

Answer 4

回答by Jason S

Just a thought:

只是一个想法：

USGS Geographic Names database
USDA PLANTS checklist
Any one of the many state GIS repositories e.g. NH's GRANIT

USGS 地名数据库
美国农业部植物清单
许多州 GIS 存储库中的任何一个，例如 NH 的GRANIT

Answer 5

回答by Carter Medlin

Google Fusion Tables has a few.

Google Fusion Tables 有一些。

http://tables.googlelabs.com/

Answer 6

回答by kemiller2002

Well for the web server logs you could always just generate them for the format you need. If you are going to test code against it etc. it will have to be tailored to the fields you want to store/parse.

好吧，对于 Web 服务器日志，您总是可以根据需要的格式生成它们。如果您要针对它等测试代码，则必须针对您要存储/解析的字段进行定制。

For the datasets used for database performance benchmarking, you'll probably want to look at a tool that can generate data for you. Red Gate has a great one for not too much money.

对于用于数据库性能基准测试的数据集，您可能希望查看可以为您生成数据的工具。红门有一个很棒的，花不了多少钱。

Answer 7

回答by viper

Datasets available hereas well.

此处也提供数据集。

Answer 8

回答by Rishi

Kaggle.com frequently has datamining challenges. The datasets cover a wide range of fienlds: healthcare provider data to credit history information. Perhaps something there is what you're after.

Kaggle.com 经常面临数据挖掘挑战。数据集涵盖了广泛的领域：医疗保健提供者数据到信用历史信息。也许有些东西是你所追求的。

Answer 9

回答by Brian Risk

http://Quandl.comhas over 10 million data sets gleaned from all over the internet. The great thing about this resource is that it gives a single way to access all of the data. The site has a free Excel plug in or there are libraries in R, Python, Ruby, etc.

http://Quandl.com拥有从互联网上收集的超过 1000 万个数据集。该资源的优点在于它提供了一种访问所有数据的方式。该站点有一个免费的 Excel 插件，或者有 R、Python、Ruby 等库。

Answer 10

回答by zeroDivisible

Well, this one is new and there is a challenge behind it:

嗯，这是一个新的，它背后有一个挑战：

Million song dataset challenge

百万歌曲数据集挑战