pandas ValueError:在 LinearSVC 期间,数组在 _assert_all_finite 中包含 NaN 或无穷大
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/21390084/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
ValueError: Array contains NaN or infinity in _assert_all_finite during LinearSVC
提问by ekta
I was trying to classify the wine data set here -http://archive.ics.uci.edu/ml/datasets/Wine+Qualityusing logistic regression (with method ='bfgs' and l1 norm) and caught a singular value matrix error(raise LinAlgError('Singular matrix'), in-spite of full rank [which I tested using np.linalg.matrix_rank(data[train_cols].values) ] .
我试图在这里对葡萄酒数据集进行分类 - http://archive.ics.uci.edu/ml/datasets/Wine+Quality使用逻辑回归(使用方法 ='bfgs' 和 l1 范数)并捕获了一个奇异值矩阵error(raise LinAlgError('Singular matrix'), 尽管满秩 [我使用 np.linalg.matrix_rank(data[train_cols].values) 进行了测试]。
This is how I came to the conclusion that some features might be linear combinations of others. Towards this, I experimented of using Grid search/LinearSVC - and I get the error below, along with my code & data-set .
这就是我得出的结论,即某些特征可能是其他特征的线性组合。为此,我尝试使用 Grid search/LinearSVC - 我收到以下错误,以及我的代码和数据集。
I can see that only 6/7 features are actually "independent" - which I interpret when comparing the rows of x_train_new[0] and x_train (so I can get which columns are redundant)
我可以看到只有 6/7 个特征实际上是“独立的”——我在比较 x_train_new[0] 和 x_train 的行时解释了这一点(所以我可以得到哪些列是多余的)
# Train & test DATA CREATION
from sklearn.svm import LinearSVC
import numpy, random
import pandas as pd
df = pd.read_csv("https://github.com/ekta1007/Predicting_wine_quality/blob/master/wine_red_dataset.csv")
#,skiprows=0, sep=',')
df=df.dropna(axis=1,how='any') # also tried how='all' - still get NaN errors as below
header=list(df.columns.values) # or df.columns
X = df[df.columns - [header[-1]]] # header[-1] = ['quality'] - this is to make the code genric enough
Y = df[header[-1]] # df['quality']
rows = random.sample(df.index, int(len(df)*0.7)) # indexing the rows that will be picked in the train set
x_train, y_train = X.ix[rows],Y.ix[rows] # Fetching the data frame using indexes
x_test,y_test = X.drop(rows),Y.drop(rows)
# Training the classifier using C-Support Vector Classification.
clf = LinearSVC(C=0.01, penalty="l1", dual=False) #,tol=0.0001,fit_intercept=True, intercept_scaling=1)
clf.fit(x_train, y_train)
x_train_new = clf.fit_transform(x_train, y_train)
#print x_train_new #works
clf.predict(x_test) # does NOT work and gives NaN errors for some x_tests
clf.score(x_test, y_test) # Does NOT work
clf.coef_ # Works, but I am not sure, if this is OK, given huge NaN's - or does the coef's get impacted ?
clf.predict(x_train)
552 NaN
209 NaN
427 NaN
288 NaN
175 NaN
427 NaN
748 7
552 NaN
429 NaN
[... and MORE]
Name: quality, Length: 1119
clf.predict(x_test)
76 NaN
287 NaN
420 7
812 NaN
443 7
420 7
430 NaN
373 5
624 5
[..and More]
Name: quality, Length: 480
The strange thing is that when I run clf.predict(x_train) I still see some NaN's - What am I doing wrong ?After all the model was trained using this, and this should NOT occur , right ?
奇怪的是,当我运行 clf.predict(x_train) 时,我仍然看到一些 NaN - 我做错了什么?毕竟模型是用这个训练的,这不应该发生,对吧?
According to this thread, I also checked that there are no null's in my csv file(though I relabeled the "quality' to 5 and 7 labels only (from range(3,10) How to fix "NaN or infinity" issue for sparse matrix in python?
根据这个线程,我还检查了我的 csv 文件中没有空值(尽管我将“质量”重新标记为 5 和 7 标签(从 range(3,10) How to fix "NaN or infinity" issue for sparse python中的矩阵?
Also - here's the data type of x_test & y_test/train...
另外 - 这是 x_test & y_test/train 的数据类型...
x_test
<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 1596
Data columns:
alcohol 480 non-null values
chlorides 480 non-null values
citric acid 480 non-null values
density 480 non-null values
fixed acidity 480 non-null values
free sulfur dioxide 480 non-null values
pH 480 non-null values
residual sugar 480 non-null values
sulphates 480 non-null values
total sulfur dioxide 480 non-null values
volatile acidity 480 non-null values
dtypes: float64(11)
y_test
1 5
10 5
18 5
21 5
30 5
31 7
36 7
40 5
50 5
52 7
53 5
55 5
57 5
60 5
61 5
[..And MORE]
Name: quality, Length: 480
and Finally..
最后..
clf.score(x_test, y_test)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
clf.score(x_test, y_test)
File "C:\Python27\lib\site-packages\sklearn\base.py", line 279, in score
return accuracy_score(y, self.predict(X))
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 742, in accuracy_score
y_true, y_pred = check_arrays(y_true, y_pred)
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 215, in check_arrays
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 18, in _assert_all_finite
ValueError: Array contains NaN or infinity.
#I also explicitly checked for NaN's as here -:
for i in df.columns:
df[i].isnull()
Tip :Please also mention if my thought process on using LinearSVC is correct, given my use case, or should I use Grid-search ?
提示:还请说明我使用 LinearSVC 的思考过程是否正确,考虑到我的用例,还是应该使用 Grid-search ?
Disclaimer: Parts of this code have been built on suggestions in similar contexts from StackOverflow and miscellaneous sources - My real use case is just trying to access if this method is a good fit for my scenario. That's all.
免责声明:此代码的部分内容基于来自 StackOverflow 和其他来源的类似上下文中的建议 - 我的实际用例只是尝试访问此方法是否适合我的场景。就这样。
回答by ekta
This worked. The only I had to really change was use x_test*.values* along with the rest of pandas Dataframes(x_train, y_train, y_test) . As pointed out the only reason was incompatibility between pandas df and scikit-learn(which uses numpy arrays)
这奏效了。我唯一需要真正改变的是使用 x_test* .values* 以及其余的 pandas Dataframes(x_train, y_train, y_test) 。正如所指出的,唯一的原因是 pandas df 和 scikit-learn(使用 numpy 数组)之间的不兼容
#changing your Pandas Dataframe elegantly to work with scikit-learn by transformation to numpy arrays
>>> type(x_test)
<class 'pandas.core.frame.DataFrame'>
>>> type(x_test.values)
<type 'numpy.ndarray'>
This hack comes from this post http://python.dzone.com/articles/python-making-scikit-learn-andand @AndreasMueller - who pointed out the inconsistency.
这个黑客来自这篇文章http://python.dzone.com/articles/python-making-scikit-learn-and和@AndreasMueller - 谁指出了不一致。

