Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Custom Code Metric

Solved!
rmoore
Custom Code Metric

I'm attempting to make the best use of the "Custom Code" option for hyperparameter optimization and have a few questions. For reference, here are the comments on how to write the custom function:

# - y_pred is a numpy ndarray with shape:
# - (nb_records,) for regression problems and classification problems
# where 'needs probas' (see below) is false
# (for classification, the values are the numeric class indexes)
# - (nb_records, nb_classes) for classification problems where 'needs probas' is true
  • No real issues when "needs_probas" is false. By appending y_pred as a column to X_valid, I'm able to see which rows were predicted as "True" for my binary classification problem (for a given threshold)

With "needs_probas" set to true, I run into some problems.

  • It appears that the shape of y_pred is different depending on whether the model is training or scoring. Here's the code I've implemented that seems to solve the problem (again for a binary classification problem). I'm wondering if this should be necessary or if I'm missing something?
 if len(np.shape(y_pred)) == 2:
        # scoring the model
        ds['probas'] = y_pred[:,1]
    else:
        # training the model
        ds['probas'] = y_pred[0:]
  • With "needs_probas" false, the scoring seems to be dependent on the threshold (a row's prediction will be "true" when the proba is above a threshold) and with "needs_probas" false, it appears that the threshold is not provided to the scoring function. Is this correct and expected or am I missing something? Maybe the "needs_threshold" property just isn't implemented in DSS for binary classifications? (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)

 

0 Kudos
1 Solution
fchataigner2
Dataiker

Hi,

you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.

You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.

View solution in original post

0 Kudos
6 Replies
fchataigner2
Dataiker

Hi,

you stumbled indeed on a not-so-nice behavior of the custom scoring handling, which passes the full output of predict_probas when doing the final scoring, and only the second column (the positive case) when doing hyperparameter search and k-fold. Your solution is essentially the best one can come up with.

You needn't worry about the threshold, as it is computed after scoring with your code: DSS will try different threshold values and call your scoring code with a single-column corresponding to the positive case.

0 Kudos
MehdiH
Dataiker

Hi @rmoore ,

This has been fixed in release 8.0.2 : 

More precisely, the custom metric function can now correctly assume a y_pred shape of (N, 2) in the case of binary classification with needs_proba == True, when performing a hyperparameters search

Cheers

elnurmdov
Level 1

Hello, I am trying to create customer metrics to return precision score for first 100 predictions. Code is below:

import pandas as pd
from sklearn.metrics import precision_score

def score(y_valid, y_pred):
    
    """
    Custom scoring function.
    Must return a float quantifying the estimator prediction quality.
      - y_valid is a pandas Series
      - y_pred is a numpy ndarray with shape:
           - (nb_records,) for regression problems and classification problems
             where 'needs probas' (see below) is false
             (for classification, the values are the numeric class indexes)
           - (nb_records, nb_classes) for classification problems where
             'needs probas' is true
      - [optional] X_valid is a dataframe with shape (nb_records, nb_input_features)
      - [optional] sample_weight is a numpy ndarray with shape (nb_records,)
                   NB: this option requires a variable set as "Sample weights"
    """

    scoring = pd.DataFrame()
    scoring['actual'] = y_valid 
    scoring['probability'] = y_pred[:, 1]
    scoring = scoring.sort_values(by = 'probability', ascending = False)
    top_100 = scoring.iloc[:100]
    pr_score = precision_score(top_100['actual'], top_100['probability'])
    return pr_score

    

I am getting following error:

File "<string>", line 23, in score
IndexError: too many indices for array

 

0 Kudos
elnurmdov
Level 1

Hello, where you pass needs_proba parameter?

0 Kudos
Chelsea
Level 1

Hi, from contacting Dataiku I have found that this feature will be available within Dataiku 11. 

0 Kudos
elnurmdov
Level 1

Hello, https://doc.dataiku.com/dss/latest/release_notes/index.html, it seems that DataIku 11 is already released. Correct me if I am mistaken

0 Kudos