是否有 Python 的示例数据集?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16579407/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 23:04:56  来源:igfitidea点击:

Are there any example data sets for Python?

pythondataset

提问by a different ben

For quick testing, debugging, creating portable examples, and benchmarking, R has available to it a large number of data sets (in the Base R datasetspackage). The command library(help="datasets")at the R prompt describes nearly 100 historical datasets, each of which have associated descriptions and metadata.

为了快速测试、调试、创建可移植示例和基准测试,R 提供了大量数据集(在 Base Rdatasets包中)。library(help="datasets")R 提示符下的命令描述了近 100 个历史数据集,每个数据集都有相关的描述和元数据。

Is there anything like this for Python?

Python有这样的东西吗?

采纳答案by Aziz Alto

You can use rpy2package to access all R datasets from Python.

您可以使用rpy2package 从 Python 访问所有 R 数据集。

Set up the interface:

设置界面:

>>> from rpy2.robjects import r, pandas2ri
>>> def data(name): 
...    return pandas2ri.ri2py(r[name])

Then call data()with any dataset's name of the available datasets (just like in R)

然后data()使用可用数据集的任何数据集名称调用(就像 in R

>>> df = data('iris')
>>> df.describe()
       Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

To see a list of the available datasets with description for each:

要查看可用数据集的列表以及每个数据集的描述:

>>> print(r.data())


Note: rpy2 requires Rinstallation with setting R_HOMEvariable, and pandasmust be installed as well.


注意:rpy2 需要R安装设置R_HOME变量,并且pandas必须安装。

UPDATE:

更新:

I just created PyDataset, which is a simple module to make loading a dataset from Python as easy as R's (and it does not require Rinstallation, only pandas).

我刚刚创建了PyDataset,它是一个简单的模块,可以使从 Python 加载数据集变得像R's一样简单(它不需要R安装,只需要pandas)。

To start using it, install the module:

要开始使用它,请安装模块:

$ pip install pydataset

$ pip install pydataset

then just load up any dataset you wish (currently around 757 datasets available) :

然后只需加载您想要的任何数据集(目前大约有 757 个可用数据集):

from pydataset import data

titanic = data('titanic')

回答by a different ben

Following Joran's comment, I've since found the statsmodelsmodule, which provides its own datasetspackage. The online documentationshows an example of how to import datasets available in R:

在 Joran 的评论之后,我找到了statsmodels模块,它提供了自己的datasets包。该在线文档显示R中提供如何导入数据集的例子:

import statsmodels.api as sm
duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
print duncan_prestige.__doc__

回答by Ekta

MyMVPA is another module which provides easy access to databases. You can check the link below.

MyMVPA 是另一个提供对数据库的轻松访问的模块。您可以查看下面的链接。

>>> from mvpa2.tutorial_suite import *
>>> data = [[  1,  1, -1],
...         [  2,  0,  0],
...         [  3,  1,  1],
...         [  4,  0, -1]]
>>> ds = Dataset(data)
>>> ds.shape
(4, 3)
>>> len(ds)
4

Example from the link

来自链接的示例

http://www.pymvpa.org/tutorial_datasets.html

http://www.pymvpa.org/tutorial_datasets.html

回答by tmthydvnprt

There are also datasets available from the Scikit-Learnlibrary.

Scikit-Learn库中也有可用的数据集。

from sklearn import datasets

There are multiple datasets within this package. Some of the Toy Datasetsare:

这个包中有多个数据集。一些玩具数据集是:

load_boston()          Load and return the boston house-prices dataset (regression).
load_iris()            Load and return the iris dataset (classification).
load_diabetes()        Load and return the diabetes dataset (regression).
load_digits([n_class]) Load and return the digits dataset (classification).
load_linnerud()        Load and return the linnerud dataset (multivariate regression).

回答by sedeh

Concretely, using @tmthydvnprt example:

具体来说,使用@tmthydvnprt 示例:

from sklearn import datasets
iris = datasets.load_iris()

The actual dataset can be called by doing iris.data.

可以通过执行调用实际数据集iris.data

http://scikit-learn.org/stable/datasets/

http://scikit-learn.org/stable/datasets/

Running Python 3.5

运行 Python 3.5

回答by joelostblom

I originally posted this over at the related question Sample Datasets in Pandas, but since it is relevant outside pandas I am including it here as well.

我最初在 Pandas中的相关问题Sample Datasets 中发布了此内容,但由于它与 pandas 之外的内容相关,因此我也将其包含在此处。

There are many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.

现在有许多方法可用于访问 Python 中的示例数据集。就个人而言,我倾向于坚持使用我已经使用的任何软件包(通常是 seaborn 或 pandas)。如果你需要离线访问,用 Quilt 安装数据集似乎是唯一的选择。

Seaborn

海伯恩

The brilliant plotting package seabornhas several built-in sample data sets.

出色的绘图包seaborn有几个内置的示例数据集。

import seaborn as sns

iris = sns.load_dataset('iris')
iris.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Pandas

熊猫

If you do not want to import seaborn, but still want to access its sample data sets, you can read the seaborn sample data from its URL:

如果您不想导入seaborn,但仍想访问其示例数据集,则可以从其 URL 中读取 seaborn 示例数据:

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

Note that the sample data sets containing categorical columns have their column type modified by sns.load_dataset()and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.

请注意,包含分类列的示例数据集的列类型修改为sns.load_dataset(),直接从 url 获取结果可能会有所不同。iris 和 tips 示例数据集也可在此处的 pandas github 存储库中找到

R sample datasets

R 样本数据集

Since any dataset can be read via pd.read_csv(), it is possible to access all R's sample data sets by copying the URLs from this R data set repository.

由于可以通过 读取任何数据集pd.read_csv(),因此可以通过复制此 R 数据集存储库中的 URL 来访问所有 R 的示例数据集。

Additional ways of loading the R sample data sets include statsmodel

加载 R 样本数据集的其他方法包括 statsmodel

import statsmodels.api as sm

iris = sm.datasets.get_rdataset('iris').data

and PyDataset

PyDataset

from pydataset import data

iris = data('iris')

scikit-learn

scikit 学习

scikit-learnreturns sample data as numpy arrays rather than a pandas data frame.

scikit-learn将样本数据作为 numpy 数组而不是 pandas 数据框返回。

from sklearn.datasets import load_iris

iris = load_iris()
# `iris.data` holds the numerical values
# `iris.feature_names` holds the numerical column names
# `iris.target` holds the categorical (species) values (as ints)
# `iris.target_names` holds the unique categorical names

Quilt

被子

Quiltis a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as severalfrom the uciml sample repository. The quick start pageshows how to install and import the iris data set:

Quilt是一个数据集管理器,旨在促进数据集管理。它包括许多常见的示例数据集,例如 来自uciml 示例存储库的几个。在快速启动页面演示了如何安装并导入虹膜数据集:

# In your terminal
$ pip install quilt
$ quilt install uciml/iris

After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

安装数据集后,它可以在本地访问,因此如果您想离线处理数据,这是最好的选择。

import quilt.data.uciml.iris as ir

iris = ir.tables.iris()
   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

Quilt also support dataset versioning and include a short descriptionof each dataset.

Quilt 还支持数据集版本控制并包含每个数据集的简短描述