如何将 Scikit-learn 数据集转换为 Pandas 数据集?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38105539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 15:43:41  来源:igfitidea点击:

How to convert a Scikit-learn dataset to a Pandas dataset?

datasetscikit-learnpandas

提问by SANBI samples

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?

如何将数据从 Scikit-learn Bunch 对象转换为 Pandas DataFrame?

from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
print(type(data))
data1 = pd. # Is there a Pandas method to accomplish this?

回答by TomDLT

Manually, you can use pd.DataFrameconstructor, giving a numpy array (data) and a list of the names of the columns (columns). To have everything in one DataFrame, you can concatenate the features and the target into one numpy array with np.c_[...](note the []):

您可以手动使用pd.DataFrame构造函数,提供一个 numpy 数组 ( data) 和列名称列表 ( columns)。要在一个 DataFrame 中包含所有内容,您可以使用np.c_[...](注意[])将特征和目标连接到一个 numpy 数组中:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# save load_iris() sklearn dataset to iris
# if you'd like to check dataset type use: type(load_iris())
# if you'd like to view list of attributes use: dir(load_iris())
iris = load_iris()

# np.c_ is the numpy concatenate function
# which is used to concat iris['data'] and iris['target'] arrays 
# for pandas column argument: concat iris['feature_names'] list
# and string list (in this case one string); you can make this anything you'd like..  
# the original dataset would probably call this ['Species']
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

回答by justin4480

from sklearn.datasets import load_iris
import pandas as pd

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

This tutorial maybe of interest: http://www.neural.cz/dataset-exploration-boston-house-pricing.html

本教程可能感兴趣:http: //www.neural.cz/dataset-exploration-boston-house-pricing.html

回答by Nilav Baran Ghosh

TOMDLt's solution is not generic enough for all the datasets in scikit-learn. For example it does not work for the boston housing dataset. I propose a different solution which is more universal. No need to use numpy as well.

TOMDLt 的解决方案对于 scikit-learn 中的所有数据集都不够通用。例如,它不适用于波士顿住房数据集。我提出了一种更通用的不同解决方案。也不需要使用 numpy。

from sklearn import datasets
import pandas as pd

boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)
df_boston.head()

As a general function:

作为一般功能:

def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

df_boston = sklearn_to_df(datasets.load_boston())

回答by daguito81

Just as an alternative that I could wrap my head around much easier:

作为替代方案,我可以更轻松地解决问题:

data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
df.head()

Basically instead of concatenating from the get go, just make a data frame with the matrix of features and then just add the target column with data['whatvername'] and grab the target values from the dataset

基本上不是从一开始就连接,只需使用特征矩阵制作一个数据框,然后添加带有 data['whatvername'] 的目标列并从数据集中获取目标值

回答by Victor Tong

Took me 2 hours to figure this out

我花了 2 个小时才弄明白

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
##iris.keys()


df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                 columns= iris['feature_names'] + ['target'])

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

Get back the species for my pandas

为我的熊猫取回物种

回答by Mukul Aggarwal

This works for me.

这对我有用。

dataFrame = pd.dataFrame(data = np.c_[ [iris['data'],iris['target'] ],
columns=iris['feature_names'].tolist() + ['target'])

回答by Paul Rougieux

Otherwise use seaborn data setswhich are actual pandas data frames:

否则使用seaborn 数据集,它们是实际的熊猫数据框:

import seaborn
iris = seaborn.load_dataset("iris")
type(iris)
# <class 'pandas.core.frame.DataFrame'>

Compare with scikit learn data sets:

与scikit学习数据集对比:

from sklearn import datasets
iris = datasets.load_iris()
type(iris)
# <class 'sklearn.utils.Bunch'>
dir(iris)
# ['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']

回答by student

Other way to combine features and target variables can be using np.column_stack(details)

可以使用其他方法来组合特征和目标变量np.column_stack详细信息

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

data = load_iris()
df = pd.DataFrame(np.column_stack((data.data, data.target)), columns = data.feature_names+['target'])
print(df.head())

Result:

结果:

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target
0                5.1               3.5                1.4               0.2     0.0
1                4.9               3.0                1.4               0.2     0.0 
2                4.7               3.2                1.3               0.2     0.0 
3                4.6               3.1                1.5               0.2     0.0
4                5.0               3.6                1.4               0.2     0.0

If you need the string label for the target, then you can use replaceby convertingtarget_namesto dictionaryand add a new column:

如果您需要 的字符串标签target,则可以replace通过转换target_namesdictionary并添加新列来使用:

df['label'] = df.target.replace(dict(enumerate(data.target_names)))
print(df.head())

Result:

结果:

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target  label 
0                5.1               3.5                1.4               0.2     0.0     setosa
1                4.9               3.0                1.4               0.2     0.0     setosa
2                4.7               3.2                1.3               0.2     0.0     setosa
3                4.6               3.1                1.5               0.2     0.0     setosa
4                5.0               3.6                1.4               0.2     0.0     setosa

回答by dheinz

As of version 0.23, you can directly return a DataFrame using the as_frameargument. For example, loading the iris data set:

从 0.23 版本开始,您可以使用as_frame参数直接返回 DataFrame 。例如加载虹膜数据集:

from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
df = iris.data

In my understanding using the provisionally release notes, this works for the breast_cancer, diabetes, digits, iris, linnerud, wine and california_houses data sets.

根据我使用临时发行说明的理解,这适用于乳腺癌、糖尿病、数字、虹膜、linnerud、葡萄酒和 california_houses 数据集。

回答by Dhiraj Himani

Basically what you need is the "data", and you have it in the scikit bunch, now you need just the "target" (prediction) which is also in the bunch.

基本上你需要的是“数据”,你在 scikit 中拥有它,现在你只需要“目标”(预测),它也在一堆中。

So just need to concat these two to make the data complete

所以只需要concat这两个就可以使数据完整

  data_df = pd.DataFrame(cancer.data,columns=cancer.feature_names)
  target_df = pd.DataFrame(cancer.target,columns=['target'])

  final_df = data_df.join(target_df)