pandas 获取类型错误:尝试使用 idxmax() 时,此 dtype 不允许缩减操作“argmax”

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/48719937/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:09:23  来源:igfitidea点击:

Getting TypeError: reduction operation 'argmax' not allowed for this dtype when trying to use idxmax()

pythonpython-3.xpandas

提问by cod3min3

When using the idxmax()function in Pandas, I keep receiving this error.

idxmax()在 Pandas 中使用该函数时,我不断收到此错误。

Traceback (most recent call last):
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/main.py", line 20, in <module>
    best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/Classification.py", line 39, in print_kfold_scores
    best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 1369, in idxmax
    i = nanops.nanargmax(_values_from_object(self), skipna=skipna)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/nanops.py", line 74, in _f
    raise TypeError(msg.format(name=f.__name__.replace('nan', '')))
TypeError: reduction operation 'argmax' not allowed for this dtype

The Pandas version I am using is 0.22.0

我使用的Pandas版本是 0.22.0

main.py

主文件

import ExploratoryDataAnalysis as eda
import Preprocessing as processor
import Classification as classify
import pandas as pd


data_path = '/Users/username/college/year-4/fyp-credit-card-fraud/data/'

if __name__ == '__main__':
    df = pd.read_csv(data_path + 'creditcard.csv')
    # eda.init(df)
    # eda.check_null_values()
    # eda.view_data()
    # eda.check_target_classes()
    df = processor.noramlize(df)

    X_training, X_testing, y_training, y_testing, X_training_undersampled, X_testing_undersampled, \
    y_training_undersampled, y_testing_undersampled = processor.resample(df)

    best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)

Classification.py

分类.py

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, \
    roc_auc_score, roc_curve, recall_score, classification_report
import pandas as pd
import numpy as np


def print_kfold_scores(X_training, y_training):
    print('\nKFold\n')

    fold = KFold(len(y_training), 5, shuffle=False)

    c_param_range = [0.01, 0.1, 1, 10, 100]

    results = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
    results['C_parameter'] = c_param_range

    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('\n-------------------------------------------')

        recall_accs = []
        for iteration, indices in enumerate(fold, start=1):
            lr = LogisticRegression(C=c_param, penalty='l1')
            lr.fit(X_training.iloc[indices[0], :], y_training.iloc[indices[0], :].values.ravel())

            y_prediction_undersampled = lr.predict(X_training.iloc[indices[1], :].values)
            recall_acc = recall_score(y_training.iloc[indices[1], :].values, y_prediction_undersampled)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration, ': recall score = ', recall_acc)

        results.ix[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('\nMean recall score ', np.mean(recall_accs))
        print('\n')

    best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter'] # Error occurs on this line

    print('*****************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c_param)
    print('*****************************************************************')

    return best_c_param

The line that is causing the problem is this

导致问题的线路是这样的

best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']

best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']

The output of the program is below

程序的输出如下

/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/username/College/year-4/fyp-credit-card-fraud/code/main.py
/Users/username/Library/Python/3.6/lib/python/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Dataset Ratios

Percentage of genuine transactions:  0.5
Percentage of fraudulent transactions 0.5
Total number of transactions in resampled data:  984


Whole Dataset Split

Number of transactions in training dataset:  199364
Number of transactions in testing dataset:  85443
Total number of transactions in dataset:  284807


Undersampled Dataset Split

Number of transactions in training dataset 688
Number of transactions in testing dataset:  296
Total number of transactions in dataset:  984

KFold

-------------------------------------------
C parameter:  0.01

-------------------------------------------
Iteration  1 : recall score =  0.931506849315
Iteration  2 : recall score =  0.917808219178
Iteration  3 : recall score =  1.0
Iteration  4 : recall score =  0.959459459459
Iteration  5 : recall score =  0.954545454545

Mean recall score  0.9526639965


-------------------------------------------
C parameter:  0.1

-------------------------------------------
Iteration  1 : recall score =  0.849315068493
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.915254237288
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.909090909091

Mean recall score  0.89652397189


-------------------------------------------
C parameter:  1

-------------------------------------------
Iteration  1 : recall score =  0.86301369863
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.983050847458
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.924242424242

Mean recall score  0.915853322981


-------------------------------------------
C parameter:  10

-------------------------------------------
Iteration  1 : recall score =  0.849315068493
Iteration  2 : recall score =  0.876712328767
Iteration  3 : recall score =  0.983050847458
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.939393939394

Mean recall score  0.918883626012


-------------------------------------------
C parameter:  100

-------------------------------------------
Iteration  1 : recall score =  0.86301369863
Iteration  2 : recall score =  0.876712328767
Iteration  3 : recall score =  0.983050847458
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.924242424242

Mean recall score  0.918593049009


Traceback (most recent call last):
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/main.py", line 20, in <module>
    best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/Classification.py", line 39, in print_kfold_scores
    best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 1369, in idxmax
    i = nanops.nanargmax(_values_from_object(self), skipna=skipna)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/nanops.py", line 74, in _f
    raise TypeError(msg.format(name=f.__name__.replace('nan', '')))
TypeError: reduction operation 'argmax' not allowed for this dtype

Process finished with exit code 1

回答by Allen

#best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']

We should replace this line of code

我们应该替换这行代码

The main problem:

主要问题:

1) the type of "mean recall score" is object, you can't use "idxmax()" to calculate the value 2) you should change "mean recall score" from "object " to "float" 3) you can use apply(pd.to_numeric, errors = 'coerce', axis = 0) to do such things.

1)“平均召回分数”的类型是对象,您不能使用“idxmax()”来计算值 2)您应该将“平均召回分数”从“对象”更改为“浮动” 3)您可以使用apply(pd.to_numeric, errors = 'coerce', axis = 0) 来做这样的事情。

best_c = results_table
best_c.dtypes.eq(object) # you can see the type of best_c
new = best_c.columns[best_c.dtypes.eq(object)] #get the object column of the best_c
best_c[new] = best_c[new].apply(pd.to_numeric, errors = 'coerce', axis=0) # change the type of object
best_c
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter'] #calculate the mean values

回答by Lucas Azevedo

The type of the cell values are, by default, non-numeric. argmin(), idxmin(), argmax()and other similar functions need the dtypes to be numeric.

默认情况下,单元格值的类型是非数字的。argmin(), idxmin(),argmax()和其他类似的函数需要 dtypes 是数字。

The easiest solution is to use pd.to_numeric()in order to convert your series (or columns) to numeric types. An example with a data frame dfwith a column 'a'would be:

最简单的解决方案是pd.to_numeric()将系列(或列)转换为数字类型。df带有列的数据框的示例'a'是:

df['a'] = pd.to_numeric(df['a'])

df['a'] = pd.to_numeric(df['a'])

A more complete answer on type casting on pandas can be found here.

可以在此处找到有关 Pandas 类型转换的更完整答案。

Hope that helps :)

希望有帮助:)

回答by demongolem

If NaN are present (and we can sort of see this by the stack trace) then when you think you are working with a data frame of numerics, you could well have mixed types, and in particular, a string among numerics. Let me give you 3 code examples, the first 2 work, the last doesn't and is likely your case.

如果存在 NaN(我们可以通过堆栈跟踪看到这一点),那么当您认为您正在处理数字数据框时,您很可能拥有混合类型,特别是数字之间的字符串。让我给您 3 个代码示例,前 2 个有效,最后一个无效,很可能是您的情况。

This represents all numeric data, it will work with idxmax

这代表所有数字数据,它将与 idxmax 一起使用

the_dict = {}
the_dict['a'] = [0.1, 0.2, 0.5]
the_dict['b'] = [0.3, 0.4, 0.6]
the_dict['c'] = [0.25, 0.3, 0.9]
the_dict['d'] = [0.2, 0.1, 0.4]
the_df = pd.DataFrame(the_dict)

This represents a numeric nan, it will work idxmax

这代表一个数字 nan,它将工作 idxmax

the_dict = {}
the_dict['a'] = [0.1, 0.2, 0.5]
the_dict['b'] = [0.3, 0.4, 0.6]
the_dict['c'] = [0.25, 0.3, 0.9]
the_dict['d'] = [0.2, 0.1, np.NaN]
the_df = pd.DataFrame(the_dict)

This could be the exact problem reported by the OP, but if it turns out we have mixed types in any fashion, we will get the error the OP reported.

这可能是 OP 报告的确切问题,但如果结果证明我们以任何方式混合了类型,我们将得到 OP 报告的错误。

the_dict = {}
the_dict['a'] = [0.1, 0.2, 0.5]
the_dict['b'] = [0.3, 0.4, 0.6]
the_dict['c'] = [0.25, 0.3, 0.9]
the_dict['d'] = [0.2, 0.1, 'NaN']
the_df = pd.DataFrame(the_dict)

回答by Hadij

In short, try this

总之,试试这个

best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']

instead of

代替

best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']