database 大型公共数据集?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/381806/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Large public datasets?
提问by
I am looking for some large public datasets, in particular:
我正在寻找一些大型公共数据集,特别是:
Large sample web server logs that have been anonymized.
Datasets used for database performance benchmarking.
已匿名的大型示例 Web 服务器日志。
用于数据库性能基准测试的数据集。
Any other links to large public datasets would be appreciated. I already know about Amazon's public datasets at: http://aws.amazon.com/publicdatasets/
任何其他指向大型公共数据集的链接将不胜感激。我已经在以下位置了解 Amazon 的公共数据集:http: //aws.amazon.com/publicdatasets/
回答by MrGomez
1. Large sample web server logs that have been anonymized.
1. 已匿名化的大型示例 Web 服务器日志。
These work to start with:
这些工作开始于:
There are many, many more data sets available than these (see the gamut of other answers), but this is the lowest hanging fruit that meets your original criteria. As a bonus, they have a contact linkif you have specific needs they may know of.
可用的数据集比这些多得多(请参阅其他答案的范围),但这是符合您原始标准的最低限度的悬而未决的成果。作为奖励,如果您有他们可能知道的特定需求,他们会提供联系链接。
2. Datasets used for database performance benchmarking.
2. 用于数据库性能基准测试的数据集。
This sounds like a misnomer, because you're asking for empirical data sets that describe well-definedalgorithmicproblems. Specifically, it sounds like you're trying to find sets of data that you can use to test and benchmark various database systems in real time, using well-defined, normalized relational data that can be used as a set of test cases for determining the most efficient solution that meets your needs.
这听起来像是用词不当,因为您要求的是描述明确算法问题的经验数据集。具体来说,听起来您正在尝试使用定义明确的规范化关系数据来查找可用于实时测试和基准测试各种数据库系统的数据集,这些数据可用作一组测试用例来确定最有效的解决方案,满足您的需求。
I don't agree with this approach. Instead of finding a litany of database systems and their canned implementations, it's far better to explore the algorithmicguaranteesof these systems as your first port of call. Once you've determined the algorithmic constraints that meet your needs, you can hone in on a set of canned solutions that you can benchmark on efficiency of, for example, indexing, sorting, searching, insertion, deletion, and retrieval.
我不同意这种做法。与其寻找一连串的数据库系统及其固定实现,不如探索这些系统的算法保证作为您的第一站。一旦您确定了满足您需求的算法约束,您就可以研究一组固定的解决方案,您可以对这些解决方案的效率进行基准测试,例如索引、排序、搜索、插入、删除和检索。
Wikipedia provides a terse article on database testing conceptsthat you can use to determine and write test cases for benchmarking performance. For example, you might use an agnostic data access interface like JDBCand JDBC Benchmarkto determine the relative timings of each operation. From here, you can hone in on a correct solution.
维基百科提供了一篇关于数据库测试概念的简洁文章,您可以使用它来确定和编写测试用例以进行性能基准测试。例如,您可以使用不可知的数据访问接口(如JDBC和JDBC Benchmark)来确定每个操作的相对时间。从这里,您可以磨练正确的解决方案。
In short,go to the researchfirst for determining database guarantees. Once a set of candidate solutions has been identified, you can select amongst those by testing (or otherwise determining) the constant time performance of each desired operation.
总之,去研究首先确定数据库的保证。一旦确定了一组候选解决方案,您就可以通过测试(或以其他方式确定)每个所需操作的恒定时间性能来从中进行选择。
回答by caesar0301
Based on Quora answersand my personal collections in my studies, an awesome-public-datasetsrepository was created and updated lively on GitHub:
根据Quora 的回答和我在研究中的个人收藏,在 GitHub 上创建和更新了一个很棒的公共数据集存储库:
Below is a snapshot version of this list. For a newest list, please visit Github:
以下是此列表的快照版本。如需最新列表,请访问Github:
This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.
这个公共数据源列表是从博客、答案和用户响应中收集和整理的。下面列出的大多数数据集都是免费的,但也有一些不是。此列表来自https://github.com/caesar0301/awesome-public-datasets。
Climate
气候
- Australian Weather: http://www.bom.gov.au/climate/dwo/
- Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datterand ftp://ftp.cmdl.noaa.gov/
- Global climate data since 1929: http://www.tutiempo.net/en/Climate
- NOAA Bering Sea Climate: http://www.beringclimate.noaa.gov/
- NOAA climate datasets: http://ncdc.noaa.gov/data-access/quick-links
- WU Historical Weather Worldwide: http://www.wunderground.com/history/index.html
- 澳大利亚天气:http: //www.bom.gov.au/climate/dwo/
- 气候数据:http: //www.cru.uea.ac.uk/cru/data/temperature/#datter和ftp://ftp.cmdl.noaa.gov/
- 自 1929 年以来的全球气候数据:http: //www.tutiempo.net/en/Climate
- NOAA 白令海气候:http: //www.beringclimate.noaa.gov/
- NOAA 气候数据集:http: //ncdc.noaa.gov/data-access/quick-links
- WU 全球历史天气:http: //www.wunderground.com/history/index.html
Economics
经济学
- American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
- EconData (UMD): http://inforumweb.umd.edu/econdata/econdata.html
- Internet Product Code Database: http://www.upcdatabase.com/
- World bank: http://data.worldbank.org/indicator
- 美国经济协会。(AEA): http://www.aeaweb.org/RFE/toc.php?show=complete
- 经济数据(UMD):http://inforumweb.umd.edu/econdata/econdata.html
- 互联网产品代码数据库:http: //www.upcdatabase.com/
- 世界银行:http: //data.worldbank.org/indicator
Finance
金融
- CBOE Futures Exchange: http://cfe.cboe.com/Data/
- Google Finance: https://www.google.com/finance
- Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
- NASDAQ: https://data.nasdaq.com/
- OANDA: http://www.oanda.com/
- OSU Financial data: http://fisher.osu.edu/fin/osudata.htm
- Quandl: http://www.quandl.com/
- St Louis Federal: http://research.stlouisfed.org/fred2/
- Yahoo Finance: http://finance.yahoo.com/
- 芝加哥期权交易所:http: //cfe.cboe.com/Data/
- 谷歌财经:https: //www.google.com/finance
- 谷歌趋势:http: //www.google.com/trends?q =google&ctab =0&geo =all&date =all&sort =0
- 纳斯达克:https: //data.nasdaq.com/
- OANDA:http: //www.oanda.com/
- 俄勒冈州立大学财务数据:http: //fisher.osu.edu/fin/osudata.htm
- Quandl:http://www.quandl.com/
- 圣路易斯联邦:http: //research.stlouisfed.org/fred2/
- 雅虎财经:http: //finance.yahoo.com/
Biology
生物学
- CRCNS: http://crcns.org/data-sets
- Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
- Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
- MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
- NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
- Protein structure: http://www.infobiotic.net/PSPbenchmarks/
- Public Gene Data: http://www.pubgene.org/
- Stanford Microarray Data: http://smd.stanford.edu/
- UniGene: http://www.ncbi.nlm.nih.gov/unigene
- CRCNS:http://crcns.org/data-sets
- 基因表达综合:http: //www.ncbi.nlm.nih.gov/geo/
- 人类微生物组计划:http: //www.hmpdacc.org/reference_genomes/reference_genomes.php
- 麻省理工学院癌症基因组数据:http: //www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
- NIH 微阵列数据:ftp: //ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
- 蛋白质结构:http: //www.infobiotic.net/PSPbenchmarks/
- 公共基因数据:http: //www.pubgene.org/
- 斯坦福微阵列数据:http: //smd.stanford.edu/
- UniGene:http://www.ncbi.nlm.nih.gov/unigene
Physics
物理
Healthcare
卫生保健
- EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm
- Gapminder: http://www.gapminder.org/data/
- Medicare Data File: http://go.cms.gov/19xxPN4
- EHDP 大型健康数据集:http://www.ehdp.com/vitalnet/datasets.htm
- Gapminder:http://www.gapminder.org/data/
- 医疗保险数据文件:http: //go.cms.gov/19xxPN4
GeoSpace
地理空间
- EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse
- Factual Global Location Data: http://www.factual.com/
- Geo Spatial Data: http://geodacenter.asu.edu/datalist/
- EOSDIS:http://sedac.ciesin.columbia.edu/data/sets/browse
- 事实全球位置数据:http: //www.factual.com/
- 地理空间数据:http: //geodacenter.asu.edu/datalist/
Transportation
运输
- Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
- Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations
- Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
- Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229
- Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/
- NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013
- OpenFlights (airport, airline and route data): http://openflights.org/data.html
- RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120
- RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp
- Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
- U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm
- 航空公司数据(2009 年 ASA 挑战赛):http: //stat-computing.org/dataexpo/2009/the-data.html
- 机场及其位置:http: //www.infochimps.com/datasets/airports-and-their-locations
- 自行车共享数据系统:https: //github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
- 1990 年至 2009 年美国国内航班的边缘数据:http: //data.memect.com/?p=229
- 50 万次 Hubway 骑行:http://hubwaydatachallenge.org/trip-history-data/
- 纽约市出租车行程数据 2013 (FOIA/FOIL):https: //archive.org/details/nycTaxiTripData2013
- OpenFlights(机场、航空公司和航线数据):http: //openflights.org/data.html
- RITA 航空公司准点率数据:http: //www.transtats.bts.gov/Tables.asp?DB_ID=120
- RITA 交通数据收集:http: //www.transtats.bts.gov/DataIndex.asp
- 伦敦交通局:http: //www.tfl.gov.uk/info-for/open-data-users/our-feeds
- 美国货运分析框架:http: //ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm
Government
政府
- Archive-it: : https://www.archive-it.org/explore?show=Collections
- Australia: http://www.abs.gov.au/AUSSTATS/[email protected]/DetailsPage/3301.02009?OpenDocument
- Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
- Chicago: https://data.cityofchicago.org/
- FDA: https://open.fda.gov/index.html
- Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
- Guardian world governments: http://www.guardian.co.uk/world-government-data
- HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
- London Datastore, U.K: http://data.london.gov.uk/dataset
- New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx
- NYC betanyc: http://betanyc.us/
- NYC Open Data: http://nycplatform.socrata.com/
- OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
- RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
- San Francisco Data sets: http://datasf.org/
- The World Bank: http://wdronline.worldbank.org/
- U.K. Government Data: http://data.gov.uk/data
- U.S. Census Bureau: http://www.census.gov/data.html
- U.S. Federal Government Agencies: http://www.data.gov/metric
- U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
- U.S. Open Government: http://www.data.gov/open-gov/
- UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/
- United Nations: http://data.un.org/
- US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
- 存档它::https: //www.archive-it.org/explore?show=Collections
- 澳大利亚:http: //www.abs.gov.au/AUSSTATS/[email protected]/DetailsPage/3301.02009?OpenDocument
- 加拿大:http: //www.data.gc.ca/default.asp?lang=En&n =5BCD274E-1
- 芝加哥:https: //data.cityofchicago.org/
- FDA:https: //open.fda.gov/index.html
- 美联储统计:http: //www.fedstats.gov/cgi-bin/A2Z.cgi
- 守护世界政府:http: //www.guardian.co.uk/world-government-data
- HUD:http: //www.huduser.org/portal/datasets/pdrdatas.html
- 英国伦敦数据存储:http: //data.london.gov.uk/dataset
- 新西兰:http: //www.stats.govt.nz/browse_for_stats.aspx
- 纽约市 betanyc:http://betanyc.us/
- 纽约市开放数据:http: //nycplatform.socrata.com/
- 经合组织:http: //www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
- 丽塔:http: //www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
- 旧金山数据集:http: //datasf.org/
- 世界银行:http: //wdronline.worldbank.org/
- 英国政府数据:http: //data.gov.uk/data
- 美国人口普查局:http: //www.census.gov/data.html
- 美国联邦政府机构:http: //www.data.gov/metric
- 美国联邦政府数据目录:http: //catalog.data.gov/dataset
- 美国开放政府:http: //www.data.gov/open-gov/
- 英国 2011 年人口普查开放地图集项目:http: //www.alex-singleton.com/2011-census-open-atlas-project/
- 联合国:http: //data.un.org/
- 美国 CDC 公共卫生数据集:http: //www.cdc.gov/nchs/data_access/ftp_data.htm
Data Challenges
数据挑战
- Challenges in Machine Learning: http://www.chalearn.org/
- ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/
- Kaggle Competition Data: http://www.kaggle.com/
- KDD Cup by Tencent 2012: https://www.kddcup2012.org/
- Netflix Prize: http://www.netflixprize.com/leaderboard
- Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge
- 机器学习的挑战:http: //www.chalearn.org/
- ICWSM 数据挑战赛(自 2009 年起):http://icwsm.cs.umbc.edu/
- Kaggle 比赛数据:http://www.kaggle.com/
- 2012腾讯KDD杯:https: //www.kddcup2012.org/
- Netflix 奖:http: //www.netflixprize.com/leaderboard
- Yelp 数据集挑战赛:http: //www.yelp.com/dataset_challenge
Machine Learning
机器学习
- eBay Online Auctions: http://www.modelingonlineauctions.com/datasets
- IMDb database: http://www.imdb.com/interfaces
- Keel Repository: http://sci2s.ugr.es/keel/datasets.php
- Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action
- Machine Learning Data Set Repository: http://mldata.org/
- Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
- More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
- MovieLens Data Sets: http://datahub.io/dataset/movielens
- RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data
- Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
- SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/
- UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
- University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html
- Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
- eBay 在线拍卖:http: //www.modelingonlineauctions.com/datasets
- IMDb 数据库:http: //www.imdb.com/interfaces
- 龙骨存储库:http: //sci2s.ugr.es/keel/datasets.php
- Lending Club 贷款数据:https: //www.lendingclub.com/info/download-data.action
- 机器学习数据集存储库:http: //mldata.org/
- 百万歌曲数据集:http: //blog.echonest.com/post/3639160982/million-song-dataset
- 更多歌曲数据集:http: //labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
- MovieLens 数据集:http: //datahub.io/dataset/movielens
- RDataMining R 和数据挖掘电子书数据:http://www.rdatamining.com/data
- 在地球上注册的陨石:http: //www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
- SF 餐厅数据集:http: //missionlocal.org/san-francisco-restaurant-health-inspections/
- UCI 机器学习库:http: //archive.ics.uci.edu/ml/
- 多伦多大学 Delve 数据集:http: //www.cs.toronto.edu/~delve/data/datasets.html
- 雅虎评级和分类数据:http://webscope.sandbox.yahoo.com/catalog.php?datatype =r
Natural Language
自然语言
- 40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list
- ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/
- ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/
- Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
- Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670
- Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
- Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
- Hansards: http://www.isi.edu/natural-language/download/hansard/
- Machine Translation: http://statmt.org/wmt11/translation-task.html#download
- SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
- USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
- WordNet: http://wordnet.princeton.edu/wordnet/download/
- 上下文中的 4000 万个实体:https: //code.google.com/p/wiki-links/downloads/list
- ClueWeb09 FACC:http://lemurproject.org/clueweb09/FACC1/
- ClueWeb12 FACC:http://lemurproject.org/clueweb12/FACC1/
- Flickr 个人分类法:http: //www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
- 谷歌图书 Ngrams:http://aws.amazon.com/datasets/8172056142375670
- Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
- 古腾堡电子书列表:http: //www.gutenberg.org/wiki/Gutenberg: Offline_Catalogs
- 手册:http://www.isi.edu/natural-language/download/hansard/
- 机器翻译:http: //statmt.org/wmt11/translation-task.html#download
- 垃圾短信收集:http: //www.dt.fee.unicamp.br/~tiago/smsspamcollection/
- USENET 语料库:http: //www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
- WordNet:http: //wordnet.princeton.edu/wordnet/download/
Image Processing
图像处理
- 2GB of photos of cats: http://bit.do/UJZZ
- Face Recognition Benchmark: http://www.face-rec.org/databases/
- ImageNet: http://www.image-net.org/
- 2GB 猫的照片:http: //bit.do/UJZZ
- 人脸识别基准:http: //www.face-rec.org/databases/
- ImageNet:http: //www.image-net.org/
Time Series
时间序列
- Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl
- UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
- 时间序列数据库:https: //datamarket.com/data/list/?q=provider: tsdl
- 加州大学河滨时间序列:http: //www.cs.ucr.edu/~eamonn/time_series_data/
Social Sciences
社会科学
- China Hotel Checkin/out data: http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml
- CMU Enron Email: http://www.cs.cmu.edu/~enron/
- Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php
- Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
- Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
- Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn
- General Social Survey (GSS): http://www3.norc.org/GSS+Website/
- GetGlue (users rating TV shows): http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz
- GitHub Archive: http://www.githubarchive.org/
- ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
- Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
- PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/
- Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
- SourceForge Graph: http://www.nd.edu/~oss/Data/data.html
- Titanic Survival Data Set: https://github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
- Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html
- UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/
- UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
- UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php
- Universities Worldwide: http://univ.cc/
- UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html
- Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g
- Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/
- CN 酒店入住/退房数据:http: //www.360doc.com/content/13/1105/13/7863900_326788919.shtml
- CMU 安然电子邮件:http: //www.cs.cmu.edu/~enron/
- Facebook 社交网络(自 2007 年起):http: //law.di.unimi.it/datasets.php
- Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
- Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
- Foursquare(UMN/Sarwat,2013):https: //archive.org/details/201309_foursquare_dataset_umn
- 一般社会调查(GSS):http: //www3.norc.org/GSS+Website/
- GetGlue(用户评分电视节目):http: //getglue-data.s3.amazonaws.com/getglue_sample.tar.gz
- GitHub 存档:http: //www.githubarchive.org/
- ICPSR:http: //www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
- 移动社交网络 (UMASS):https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
- PewResearch 互联网项目:http://www.pewinternet.org/datasets/pages/2/
- 社交网络:http: //www.cs.cmu.edu/~jelsas/data/ancestry.com/
- SourceForge 图:http: //www.nd.edu/~oss/Data/data.html
- 泰坦尼克号生存数据集:https: //github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
- 推特图:http: //an.kaist.ac.kr/traces/WWW2010.html
- 加州大学伯克利分校的 D-Lab 成就:http://ucdata.berkeley.edu/
- 加州大学洛杉矶分校社会科学数据档案:http: //dataarchives.ss.ucla.edu/Home.DataPortals.htm
- UNIMI 社交网络数据集:http://law.di.unimi.it/datasets.php
- 全球大学:http: //univ.cc/
- UPJOHN 就业研究:http://www.upjohn.org/erdc/erdc.html
- 雅虎图表和社交数据:http://webscope.sandbox.yahoo.com/catalog.php?datatype =g
- Youtube Graph (2007,2008):http: //netsg.cs.sfu.ca/youtubedata/
Complex Networks
复杂网络
- CrossRef DOI URLs: https://archive.org/details/doi-urls
- DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP
- NBER Patent Citations: http://nber.org/patents/
- NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html
- Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
- PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
- Scopus Citation Database: http://www.elsevier.com/online-tools/scopus
- Stanford GraphBase (Steven Skiena): http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
- Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
- The Koblenz Network Collection: http://konect.uni-koblenz.de/
- UCI Network Data Repository: http://networkdata.ics.uci.edu/resources.php
- UFL sparse matrix collection: http://www.cise.ufl.edu/research/sparse/matrices/
- UNIMI Large Web Graph: http://law.di.unimi.it/datasets.php
- WSU Graph Database: http://www.eecs.wsu.edu/mgd/gdb.html
- CrossRef DOI URL:https: //archive.org/details/doi-urls
- DBLP 引文数据集:https: //kdl.cs.umass.edu/display/public/DBLP
- NBER 专利引用:http: //nber.org/patents/
- NIST 复杂网络数据收集:http: //math.nist.gov/~RPozo/complex_datasets.html
- 蛋白质-蛋白质相互作用网络:http: //vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
- PyPI 和 Maven 依赖网络:http: //ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
- Scopus 引文数据库:http: //www.elsevier.com/online-tools/scopus
- 斯坦福 GraphBase (Steven Skiena):http://www3.cs.stonybrook.edu/~algorith/implement/graphbase/implement.shtml
- 斯坦福大型网络数据集集合:http: //snap.stanford.edu/data/
- 科布伦茨网络收藏:http: //konect.uni-koblenz.de/
- UCI 网络数据存储库:http: //networkdata.ics.uci.edu/resources.php
- UFL 稀疏矩阵集合:http: //www.cise.ufl.edu/research/sparse/matrices/
- UNIMI 大型网络图:http://law.di.unimi.it/datasets.php
- WSU 图数据库:http: //www.eecs.wsu.edu/mgd/gdb.html
Computer Networks
计算机网络
- 3.5B Web Pages: http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
- 53.5B Web clicks: http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
- CAIDA Internet Datasets: http://www.caida.org/data/overview/
- ClueWeb09: http://lemurproject.org/clueweb09/
- ClueWeb12: http://lemurproject.org/clueweb12/
- CommonCrawl Web Data: http://commoncrawl.org/the-data/get-started/
- Dartmouth CRAWDAD Wireless datasets: http://crawdad.cs.dartmouth.edu/
- OpenMobileData (MobiPerf): https://console.developers.google.com/storage/openmobiledata_public/
- UCSD Network Telescope: http://www.caida.org/projects/network_telescope/
- 3.5B 网页:http: //www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us
- 53.5B 网页点击:http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
- CAIDA 互联网数据集:http: //www.caida.org/data/overview/
- ClueWeb09:http://lemurproject.org/clueweb09/
- ClueWeb12:http://lemurproject.org/clueweb12/
- CommonCrawl 网络数据:http://commoncrawl.org/the-data/get-started/
- 达特茅斯 CRAWDAD 无线数据集:http://crawdad.cs.dartmouth.edu/
- OpenMobileData (MobiPerf):https://console.developers.google.com/storage/openmobiledata_public/
- UCSD 网络望远镜:http: //www.caida.org/projects/network_telescope/
Data SEs
数据SE
- Academic Torrents: http://academictorrents.com/
- Datahub.io: http://datahub.io/dataset
- DataMarket: https://datamarket.com/data/list/?q=all
- Harvard Dataverse: http://thedata.harvard.edu/dvn/
- Statista: http://www.statista.com/
- Freebase: http://www.freebase.com/
- 学术种子:http: //academictorrents.com/
- Datahub.io:http://datahub.io/dataset
- 数据市场:https://datamarket.com/data/list/ ?q =all
- 哈佛数据节:http://thedata.harvard.edu/dvn/
- Statista:http: //www.statista.com/
- 自由基地:http://www.freebase.com/
Public Doamins
公共领域
- Amazon: http://aws.amazon.com/datasets
- Archive.org Datasets: https://archive.org/details/datasets
- CMU JASA data archive: http://lib.stat.cmu.edu/jasadata/
- CMU StatLab collections: http://lib.stat.cmu.edu/datasets/
- Data360: http://www.data360.org/index.aspx
- Datamob.org: http://datamob.org/datasets
- Google: http://www.google.com/publicdata/directory
- infochimps: http://www.infochimps.com/
- KDNuggets Data Collections: http://www.kdnuggets.com/datasets/index.html
- Numbray: http://numbrary.com/
- RevolutionAnalytics Collection: http://www.revolutionanalytics.com/subscriptions/datasets/
- Sample R data sets: http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
- Stats4Stem R data sets: http://www.stats4stem.org/data-sets.html
- StatSci.org: http://www.statsci.org/datasets.html
- The Washington Post List: http://www.washingtonpost.com/wp-srv/metro/data/datapost.html
- UCLA SOCR data collection: http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
- UFO Reports: http://www.nuforc.org/webreports.html
- Wikileaks 911 pager intercepts: http://911.wikileaks.org/files/index.html
- Yahoo Webscope: http://webscope.sandbox.yahoo.com/catalog.php
- 亚马逊:http: //aws.amazon.com/datasets
- Archive.org 数据集:https: //archive.org/details/datasets
- CMU JASA 数据存档:http: //lib.stat.cmu.edu/jasadata/
- CMU StatLab 集合:http://lib.stat.cmu.edu/datasets/
- Data360:http://www.data360.org/index.aspx
- Datamob.org:http://datamob.org/datasets
- 谷歌:http: //www.google.com/publicdata/directory
- 信息黑猩猩:http: //www.infochimps.com/
- KDNuggets 数据集:http://www.kdnuggets.com/datasets/index.html
- 麻麻:http://numbrary.com/
- RevolutionAnalytics 集合:http: //www.revolutionanalytics.com/subscriptions/datasets/
- R 数据集示例:http: //stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
- Stats4Stem R 数据集:http: //www.stats4stem.org/data-sets.html
- StatSci.org:http://www.statsci.org/datasets.html
- 华盛顿邮报列表:http: //www.washingtonpost.com/wp-srv/metro/data/datapost.html
- 加州大学洛杉矶分校 SOCR 数据收集:http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data
- UFO 报告:http: //www.nuforc.org/webreports.html
- 维基解密 911 寻呼机拦截:http: //911.wikileaks.org/files/index.html
- 雅虎网络镜:http://webscope.sandbox.yahoo.com/catalog.php
Complementary Collections
补充收藏
- DataWrangling: http://www.datawrangling.com/some-datasets-available-on-the-web
- Inside-r: http://www.inside-r.org/howto/finding-data-internet
- Quora: http://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
- RS Collection 100+ : http://rs.io/2014/05/29/list-of-data-sets.html
- StaTrek: http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/
- DataWrangling:http://www.datawrangling.com/some-datasets-available-on-the-web
- 内部-r:http: //www.inside-r.org/howto/finding-data-internet
- Quora:http: //www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
- RS 集合 100+:http: //rs.io/2014/05/29/list-of-data-sets.html
- StaTrek:http://hsiamin.com/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/
回答by Gene De Lisa
Here are several. Have fun.
这里有几个。玩得开心。
http://archive.ics.uci.edu/ml/
http://archive.ics.uci.edu/ml/
http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1
http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1
http://gettingpastgo.socrata.com
http://gettingpastgo.socrata.com
http://books.google.com/ngrams/
http://books.google.com/ngrams/
http://medihal.archives-ouvertes.fr
http://medihal.archives-ouvertes.fr
http://timetric.com/public-data/
http://timetric.com/public-data/
http://www.dartmouthatlas.org/
http://www.dartmouthatlas.org/
http://www.imdb.com/interfaces
回答by Jason S
Just a thought:
只是一个想法:
- USGS Geographic Names database
- USDA PLANTS checklist
- Any one of the many state GIS repositories e.g. NH's GRANIT
- USGS 地名数据库
- 美国农业部植物清单
- 许多州 GIS 存储库中的任何一个,例如 NH 的GRANIT
回答by Carter Medlin
Google Fusion Tables has a few.
Google Fusion Tables 有一些。
回答by kemiller2002
Well for the web server logs you could always just generate them for the format you need. If you are going to test code against it etc. it will have to be tailored to the fields you want to store/parse.
好吧,对于 Web 服务器日志,您总是可以根据需要的格式生成它们。如果您要针对它等测试代码,则必须针对您要存储/解析的字段进行定制。
For the datasets used for database performance benchmarking, you'll probably want to look at a tool that can generate data for you. Red Gate has a great one for not too much money.
对于用于数据库性能基准测试的数据集,您可能希望查看可以为您生成数据的工具。红门有一个很棒的,花不了多少钱。
回答by Rishi
Kaggle.com frequently has datamining challenges. The datasets cover a wide range of fienlds: healthcare provider data to credit history information. Perhaps something there is what you're after.
Kaggle.com 经常面临数据挖掘挑战。数据集涵盖了广泛的领域:医疗保健提供者数据到信用历史信息。也许有些东西是你所追求的。
回答by Brian Risk
http://Quandl.comhas over 10 million data sets gleaned from all over the internet. The great thing about this resource is that it gives a single way to access all of the data. The site has a free Excel plug in or there are libraries in R, Python, Ruby, etc.
http://Quandl.com拥有从互联网上收集的超过 1000 万个数据集。该资源的优点在于它提供了一种访问所有数据的方式。该站点有一个免费的 Excel 插件,或者有 R、Python、Ruby 等库。
回答by zeroDivisible
Well, this one is new and there is a challenge behind it:
嗯,这是一个新的,它背后有一个挑战: