0 votes
Hi, I'm trying to understand how to implement proper nested cross validation, but using group k fold (data is non iid, so all lines for a subject must be in the same fold), if possible using precalculated fold id column on dataset.

questions:

1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?

2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?

3-for the outer train/test kfold, how to use a custom column with fold if assignment?

 

thanks!

Rui
asked by

1 Answer

0 votes
Best answer

Hello, 

Thanks for your input. Please find answers below in italic:

1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?

At the moment, the custom kfold option in only available in the inner grid-search for finding hyperparameters. Thanks for the suggestion, we will see if we can add this feature on the outer train/test kfold in the future.

2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?
You can find code samples on the GroupKFold in the code samples of the custom CV code screen. See below:

Note that at the moment it only works with integer columns which are passed as input to the model. We are looking to improve that in the future.


3-for the outer train/test kfold, how to use a custom column with fold if assignment?

See question 1: custom kfold on the train/test is not supported at the moment. Thanks for the input, it would be an interesting feature indeed.

In general, if you want to configure you cross-validation strategy in a custom way that is not available in the visual interface, I suggest exporting one of the visual Machine Learning models as Jupyter notebooks, and use it as a starting base to develop your own code.

Cheers,

Alexandre

 

answered by
selected by
Hi Alexandre, thanks for replying, looking into it

one question still, sorry, how to use the leave one out sample to do this by ClientId for example? I want to do cross fold in a way all records belonging to a client are on train or test fold, but never both? should by integer columns be fold ids? or user ids?

still unclear
thx!
Two cases:
1. Assuming you have several rows by ClientId:
In order to apply a leave-one-out strategy, you would use our code sample for DKULeaveOneGroupOut:
from dataiku.doctor.utils import crossval

# You need to select the column (of the design matrix) that is used to split the dataset
# This column is *after preprocessing* - so for example, categorical columns are not available
# anymore.

# To know the names of the columns after preprocessing, train a first model with regular crossval
# and find the names in the "Features" section of the model results.

# Note that the column will always be used for training

cv = crossval.DKULeaveOneGroupOut("<client_id>")
# Client_id needs to be an integer

Note that it is not ideal since it means that the client_id will be fed to the model. So there is a slight risk of overfitting. We are looking to improve that in the future.

2. Assuming there is only one row per client id:
Directly use sklearn.model_selection.LeaveOneOut in the custom CV code screen ;)
ok checking it, thx Alexandre!
836 questions
866 answers
806 comments
955 users