Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I'm attempting to make the best use of the "Custom Code" option for hyperparameter optimization and have a few questions. For reference, here are the comments on how to write the custom function:
# - y_pred is a numpy ndarray with shape:
# - (nb_records,) for regression problems and classification problems
# where 'needs probas' (see below) is false
# (for classification, the values are the numeric class indexes)
# - (nb_records, nb_classes) for classification problems where 'needs probas' is true
With "needs_probas" set to true, I run into some problems.
if len(np.shape(y_pred)) == 2:
# scoring the model
ds['probas'] = y_pred[:,1]
else:
# training the model
ds['probas'] = y_pred[0:]
Hi,
you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.
You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.
Hi,
you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.
You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.
Hi @rmoore ,
This has been fixed in release 8.0.2 :
More precisely, the custom metric function can now correctly assume a y_pred
shape of (N, 2)
in the case of binary classification with needs_proba == True
, when performing a hyperparameters search
Cheers
Hello, I am trying to create customer metrics to return precision score for first 100 predictions. Code is below:
import pandas as pd
from sklearn.metrics import precision_score
def score(y_valid, y_pred):
"""
Custom scoring function.
Must return a float quantifying the estimator prediction quality.
- y_valid is a pandas Series
- y_pred is a numpy ndarray with shape:
- (nb_records,) for regression problems and classification problems
where 'needs probas' (see below) is false
(for classification, the values are the numeric class indexes)
- (nb_records, nb_classes) for classification problems where
'needs probas' is true
- [optional] X_valid is a dataframe with shape (nb_records, nb_input_features)
- [optional] sample_weight is a numpy ndarray with shape (nb_records,)
NB: this option requires a variable set as "Sample weights"
"""
scoring = pd.DataFrame()
scoring['actual'] = y_valid
scoring['probability'] = y_pred[:, 1]
scoring = scoring.sort_values(by = 'probability', ascending = False)
top_100 = scoring.iloc[:100]
pr_score = precision_score(top_100['actual'], top_100['probability'])
return pr_score
I am getting following error:
File "<string>", line 23, in score IndexError: too many indices for array
Hello, where you pass needs_proba parameter?
Hi, from contacting Dataiku I have found that this feature will be available within Dataiku 11.
Hello, https://doc.dataiku.com/dss/latest/release_notes/index.html, it seems that DataIku 11 is already released. Correct me if I am mistaken