Python Pandas 中的示例数据集

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28417293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:14:59  来源:igfitidea点击:

Sample datasets in Pandas

pythonpandas

提问by canyon289

When using R it's handy to load "practice" datasets using

使用 R 时,使用加载“练习”数据集很方便

data(iris)

or

或者

data(mtcars)

Is there something similar for Pandas? I know I can load using any other method, just curious if there's anything builtin.

熊猫有类似的东西吗?我知道我可以使用任何其他方法加载,只是好奇是否有内置内容。

采纳答案by joelostblom

Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.

由于我最初写了这个答案,我已经用现在可用于在 Python 中访问示例数据集的多种方法对其进行了更新。就个人而言,我倾向于坚持使用我已经使用的任何软件包(通常是 seaborn 或 pandas)。如果你需要离线访问,用 Quilt 安装数据集似乎是唯一的选择。

Seaborn

海伯恩

The brilliant plotting package seabornhas several built-in sample data sets.

出色的绘图包seaborn有几个内置的示例数据集。

import seaborn as sns

iris = sns.load_dataset('iris')
iris.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

Pandas

熊猫

If you do not want to import seaborn, but still want to access its sample data sets, you can use @andrewwowens's approach for the seaborn sample data:

如果您不想导入seaborn,但仍想访问其样本数据集,您可以使用@andrewwowens 的方法处理seaborn 样本数据:

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

Note that the sample data sets containing categorical columns have their column type modified by sns.load_dataset()and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.

请注意,包含分类列的示例数据集的列类型修改为sns.load_dataset(),直接从 url 获取结果可能会有所不同。iris 和 tips 示例数据集也可在此处的 pandas github 存储库中找到

R sample datasets

R 样本数据集

Since any dataset can be read via pd.read_csv(), it is possible to access all R's sample data sets by copying the URLs from this R data set repository.

由于可以通过 读取任何数据集pd.read_csv(),因此可以通过复制此 R 数据集存储库中的 URL 来访问所有 R 的示例数据集。

Additional ways of loading the R sample data sets include statsmodel

加载 R 样本数据集的其他方法包括 statsmodel

import statsmodels.api as sm

iris = sm.datasets.get_rdataset('iris').data

and PyDataset

PyDataset

from pydataset import data

iris = data('iris')

scikit-learn

scikit 学习

scikit-learnreturns sample data as numpy arrays rather than a pandas data frame.

scikit-learn将样本数据作为 numpy 数组而不是 pandas 数据框返回。

from sklearn.datasets import load_iris

iris = load_iris()
# `iris.data` holds the numerical values
# `iris.feature_names` holds the numerical column names
# `iris.target` holds the categorical (species) values (as ints)
# `iris.target_names` holds the unique categorical names

Quilt

被子

Quiltis a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as severalfrom the uciml sample repository. The quick start pageshows how to install and import the iris data set:

Quilt是一个数据集管理器,旨在促进数据集管理。它包括许多常见的示例数据集,例如 来自uciml 示例存储库的几个。在快速启动页面演示了如何安装并导入虹膜数据集:

# In your terminal
$ pip install quilt
$ quilt install uciml/iris

After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

安装数据集后,它可以在本地访问,因此如果您想离线处理数据,这是最好的选择。

import quilt.data.uciml.iris as ir

iris = ir.tables.iris()
   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

Quilt also support dataset versioning and include a short descriptionof each dataset.

Quilt 还支持数据集版本控制并包含每个数据集的简短描述

回答by unutbu

The rpy2module is made for this:

rpy2模块是为此而制作的:

from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

r['iris'].head()

yields

产量

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa
2           4.9          3.0           1.4          0.2  setosa
3           4.7          3.2           1.3          0.2  setosa
4           4.6          3.1           1.5          0.2  setosa
5           5.0          3.6           1.4          0.2  setosa


Up to pandas 0.19 you could use pandas' own rpyinterface:

在 pandas 0.19 之前,您可以使用 pandas 自己的rpy界面:

import pandas.rpy.common as rcom
iris = rcom.load_data('iris')
print(iris.head())

yields

产量

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
1           5.1          3.5           1.4          0.2  setosa
2           4.9          3.0           1.4          0.2  setosa
3           4.7          3.2           1.3          0.2  setosa
4           4.6          3.1           1.5          0.2  setosa
5           5.0          3.6           1.4          0.2  setosa


rpy2also provides a way to convert Robjects into Python objects:

rpy2还提供了一种R对象转换为 Python 对象的方法

import pandas as pd
import rpy2.robjects as ro
import rpy2.robjects.conversion as conversion
from rpy2.robjects import pandas2ri
pandas2ri.activate()

R = ro.r

df = conversion.ri2py(R['mtcars'])
print(df.head())

yields

产量

    mpg  cyl  disp   hp  drat     wt   qsec  vs  am  gear  carb
0  21.0    6   160  110  3.90  2.620  16.46   0   1     4     4
1  21.0    6   160  110  3.90  2.875  17.02   0   1     4     4
2  22.8    4   108   93  3.85  2.320  18.61   1   1     4     1
3  21.4    6   258  110  3.08  3.215  19.44   1   0     3     1
4  18.7    8   360  175  3.15  3.440  17.02   0   0     3     2

回答by andrewwowens

Any publically available .csv file can be loaded into pandas extremely quickly using its URL. Here is an example using the iris dataset originally from the UCI archive.

任何公开可用的 .csv 文件都可以使用其 URL 极快地加载到 Pandas 中。这是一个使用最初来自 UCI 档案的 iris 数据集的示例。

import pandas as pd

file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(file_name)
df.head()

The output here being the .csv file header you just loaded from the given URL.

此处的输出是您刚从给定 URL 加载的 .csv 文件标头。

>>> df.head()
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

A memorable short URL for the same is https://j?.mp/iriscsv. This short URL will work only if it's typed and not if it's copy-pasted.

一个令人难忘的短 URL 是https://j?.mp/iriscsv. 这个简短的 URL 只有在输入时才有效,而在复制粘贴时无效。