Python Pandas 中的示例数据集
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28417293/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Sample datasets in Pandas
提问by canyon289
When using R it's handy to load "practice" datasets using
使用 R 时,使用加载“练习”数据集很方便
data(iris)
or
或者
data(mtcars)
Is there something similar for Pandas? I know I can load using any other method, just curious if there's anything builtin.
熊猫有类似的东西吗?我知道我可以使用任何其他方法加载,只是好奇是否有内置内容。
采纳答案by joelostblom
Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.
由于我最初写了这个答案,我已经用现在可用于在 Python 中访问示例数据集的多种方法对其进行了更新。就个人而言,我倾向于坚持使用我已经使用的任何软件包(通常是 seaborn 或 pandas)。如果你需要离线访问,用 Quilt 安装数据集似乎是唯一的选择。
Seaborn
海伯恩
The brilliant plotting package seaborn
has several built-in sample data sets.
出色的绘图包seaborn
有几个内置的示例数据集。
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Pandas
熊猫
If you do not want to import seaborn
, but still want to access its sample
data sets, you can use @andrewwowens's approach for the seaborn sample
data:
如果您不想导入seaborn
,但仍想访问其样本数据集,您可以使用@andrewwowens 的方法处理seaborn 样本数据:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
Note that the sample data sets containing categorical columns have their column
type modified by sns.load_dataset()
and the result might not be the same
by getting it from the url directly. The iris and tips sample data sets are also
available in the pandas github repo here.
请注意,包含分类列的示例数据集的列类型修改为sns.load_dataset()
,直接从 url 获取结果可能会有所不同。iris 和 tips 示例数据集也可在此处的 pandas github 存储库中找到。
R sample datasets
R 样本数据集
Since any dataset can be read via pd.read_csv()
, it is possible to access all
R's sample data sets by copying the URLs from this R data set
repository.
由于可以通过 读取任何数据集pd.read_csv()
,因此可以通过复制此 R 数据集存储库中的 URL 来访问所有 R 的示例数据集。
Additional ways of loading the R sample data sets include
statsmodel
加载 R 样本数据集的其他方法包括
statsmodel
import statsmodels.api as sm
iris = sm.datasets.get_rdataset('iris').data
and PyDataset
from pydataset import data
iris = data('iris')
scikit-learn
scikit 学习
scikit-learn
returns sample data as numpy arrays rather than a pandas data
frame.
scikit-learn
将样本数据作为 numpy 数组而不是 pandas 数据框返回。
from sklearn.datasets import load_iris
iris = load_iris()
# `iris.data` holds the numerical values
# `iris.feature_names` holds the numerical column names
# `iris.target` holds the categorical (species) values (as ints)
# `iris.target_names` holds the unique categorical names
Quilt
被子
Quiltis a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as severalfrom the uciml sample repository. The quick start pageshows how to install and import the iris data set:
Quilt是一个数据集管理器,旨在促进数据集管理。它包括许多常见的示例数据集,例如 来自uciml 示例存储库的几个。在快速启动页面演示了如何安装并导入虹膜数据集:
# In your terminal
$ pip install quilt
$ quilt install uciml/iris
After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.
安装数据集后,它可以在本地访问,因此如果您想离线处理数据,这是最好的选择。
import quilt.data.uciml.iris as ir
iris = ir.tables.iris()
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Quilt also support dataset versioning and include a short descriptionof each dataset.
Quilt 还支持数据集版本控制并包含每个数据集的简短描述。
回答by unutbu
The rpy2
module is made for this:
该rpy2
模块是为此而制作的:
from rpy2.robjects import r, pandas2ri
pandas2ri.activate()
r['iris'].head()
yields
产量
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
Up to pandas 0.19 you could use pandas' own rpy
interface:
在 pandas 0.19 之前,您可以使用 pandas 自己的rpy
界面:
import pandas.rpy.common as rcom
iris = rcom.load_data('iris')
print(iris.head())
yields
产量
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
rpy2
also provides a way to convert R
objects into Python objects:
rpy2
还提供了一种将R
对象转换为 Python 对象的方法:
import pandas as pd
import rpy2.robjects as ro
import rpy2.robjects.conversion as conversion
from rpy2.robjects import pandas2ri
pandas2ri.activate()
R = ro.r
df = conversion.ri2py(R['mtcars'])
print(df.head())
yields
产量
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
回答by andrewwowens
Any publically available .csv file can be loaded into pandas extremely quickly using its URL. Here is an example using the iris dataset originally from the UCI archive.
任何公开可用的 .csv 文件都可以使用其 URL 极快地加载到 Pandas 中。这是一个使用最初来自 UCI 档案的 iris 数据集的示例。
import pandas as pd
file_name = "https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv"
df = pd.read_csv(file_name)
df.head()
The output here being the .csv file header you just loaded from the given URL.
此处的输出是您刚从给定 URL 加载的 .csv 文件标头。
>>> df.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
A memorable short URL for the same is https://j?.mp/iriscsv
. This short URL will work only if it's typed and not if it's copy-pasted.
一个令人难忘的短 URL 是https://j?.mp/iriscsv
. 这个简短的 URL 只有在输入时才有效,而在复制粘贴时无效。