如何将 Scikit-learn 数据集转换为 Pandas 数据集?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38105539/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to convert a Scikit-learn dataset to a Pandas dataset?
提问by SANBI samples
How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?
如何将数据从 Scikit-learn Bunch 对象转换为 Pandas DataFrame?
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
print(type(data))
data1 = pd. # Is there a Pandas method to accomplish this?
回答by TomDLT
Manually, you can use pd.DataFrame
constructor, giving a numpy array (data
) and a list of the names of the columns (columns
).
To have everything in one DataFrame, you can concatenate the features and the target into one numpy array with np.c_[...]
(note the []
):
您可以手动使用pd.DataFrame
构造函数,提供一个 numpy 数组 ( data
) 和列名称列表 ( columns
)。要在一个 DataFrame 中包含所有内容,您可以使用np.c_[...]
(注意[]
)将特征和目标连接到一个 numpy 数组中:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
# save load_iris() sklearn dataset to iris
# if you'd like to check dataset type use: type(load_iris())
# if you'd like to view list of attributes use: dir(load_iris())
iris = load_iris()
# np.c_ is the numpy concatenate function
# which is used to concat iris['data'] and iris['target'] arrays
# for pandas column argument: concat iris['feature_names'] list
# and string list (in this case one string); you can make this anything you'd like..
# the original dataset would probably call this ['Species']
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
回答by justin4480
from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()
This tutorial maybe of interest: http://www.neural.cz/dataset-exploration-boston-house-pricing.html
本教程可能感兴趣:http: //www.neural.cz/dataset-exploration-boston-house-pricing.html
回答by Nilav Baran Ghosh
TOMDLt's solution is not generic enough for all the datasets in scikit-learn. For example it does not work for the boston housing dataset. I propose a different solution which is more universal. No need to use numpy as well.
TOMDLt 的解决方案对于 scikit-learn 中的所有数据集都不够通用。例如,它不适用于波士顿住房数据集。我提出了一种更通用的不同解决方案。也不需要使用 numpy。
from sklearn import datasets
import pandas as pd
boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)
df_boston.head()
As a general function:
作为一般功能:
def sklearn_to_df(sklearn_dataset):
df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
df['target'] = pd.Series(sklearn_dataset.target)
return df
df_boston = sklearn_to_df(datasets.load_boston())
回答by daguito81
Just as an alternative that I could wrap my head around much easier:
作为替代方案,我可以更轻松地解决问题:
data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
df.head()
Basically instead of concatenating from the get go, just make a data frame with the matrix of features and then just add the target column with data['whatvername'] and grab the target values from the dataset
基本上不是从一开始就连接,只需使用特征矩阵制作一个数据框,然后添加带有 data['whatvername'] 的目标列并从数据集中获取目标值
回答by Victor Tong
Took me 2 hours to figure this out
我花了 2 个小时才弄明白
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
Get back the species for my pandas
为我的熊猫取回物种
回答by Mukul Aggarwal
This works for me.
这对我有用。
dataFrame = pd.dataFrame(data = np.c_[ [iris['data'],iris['target'] ],
columns=iris['feature_names'].tolist() + ['target'])
回答by Paul Rougieux
Otherwise use seaborn data setswhich are actual pandas data frames:
否则使用seaborn 数据集,它们是实际的熊猫数据框:
import seaborn
iris = seaborn.load_dataset("iris")
type(iris)
# <class 'pandas.core.frame.DataFrame'>
Compare with scikit learn data sets:
与scikit学习数据集对比:
from sklearn import datasets
iris = datasets.load_iris()
type(iris)
# <class 'sklearn.utils.Bunch'>
dir(iris)
# ['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']
回答by student
Other way to combine features and target variables can be using np.column_stack
(details)
可以使用其他方法来组合特征和目标变量np.column_stack
(详细信息)
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(np.column_stack((data.data, data.target)), columns = data.feature_names+['target'])
print(df.head())
Result:
结果:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0.0
1 4.9 3.0 1.4 0.2 0.0
2 4.7 3.2 1.3 0.2 0.0
3 4.6 3.1 1.5 0.2 0.0
4 5.0 3.6 1.4 0.2 0.0
If you need the string label for the target
, then you can use replace
by convertingtarget_names
to dictionary
and add a new column:
如果您需要 的字符串标签target
,则可以replace
通过转换target_names
为dictionary
并添加新列来使用:
df['label'] = df.target.replace(dict(enumerate(data.target_names)))
print(df.head())
Result:
结果:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target label
0 5.1 3.5 1.4 0.2 0.0 setosa
1 4.9 3.0 1.4 0.2 0.0 setosa
2 4.7 3.2 1.3 0.2 0.0 setosa
3 4.6 3.1 1.5 0.2 0.0 setosa
4 5.0 3.6 1.4 0.2 0.0 setosa
回答by dheinz
As of version 0.23, you can directly return a DataFrame using the as_frame
argument.
For example, loading the iris data set:
从 0.23 版本开始,您可以使用as_frame
参数直接返回 DataFrame 。例如加载虹膜数据集:
from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
df = iris.data
In my understanding using the provisionally release notes, this works for the breast_cancer, diabetes, digits, iris, linnerud, wine and california_houses data sets.
根据我使用临时发行说明的理解,这适用于乳腺癌、糖尿病、数字、虹膜、linnerud、葡萄酒和 california_houses 数据集。
回答by Dhiraj Himani
Basically what you need is the "data", and you have it in the scikit bunch, now you need just the "target" (prediction) which is also in the bunch.
基本上你需要的是“数据”,你在 scikit 中拥有它,现在你只需要“目标”(预测),它也在一堆中。
So just need to concat these two to make the data complete
所以只需要concat这两个就可以使数据完整
data_df = pd.DataFrame(cancer.data,columns=cancer.feature_names)
target_df = pd.DataFrame(cancer.target,columns=['target'])
final_df = data_df.join(target_df)