Nested Cross Validation, Group Fold, and support for column with fold id assignment?

Solved!
UserBird
Dataiker
Nested Cross Validation, Group Fold, and support for column with fold id assignment?
Hi, I'm trying to understand how to implement proper nested cross validation, but using group k fold (data is non iid, so all lines for a subject must be in the same fold), if possible using precalculated fold id column on dataset.

questions:

1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?

2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?

3-for the outer train/test kfold, how to use a custom column with fold if assignment?



thanks!

Rui
1 Solution
Alex_Combessie
Dataiker Alumni

Hello, 



Thanks for your input. Please find answers below in italic:



1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?



At the moment, the custom kfold option in only available in the inner grid-search for finding hyperparameters. Thanks for the suggestion, we will see if we can add this feature on the outer train/test kfold in the future.



2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?

You can find code samples on the GroupKFold in the code samples of the custom CV code screen. See below:





Note that at the moment it only works with integer columns which are passed as input to the model. We are looking to improve that in the future.





3-for the outer train/test kfold, how to use a custom column with fold if assignment?



See question 1: custom kfold on the train/test is not supported at the moment. Thanks for the input, it would be an interesting feature indeed.



In general, if you want to configure you cross-validation strategy in a custom way that is not available in the visual interface, I suggest exporting one of the visual Machine Learning models as Jupyter notebooks, and use it as a starting base to develop your own code.



Cheers,



Alexandre



 

View solution in original post

8 Replies
Alex_Combessie
Dataiker Alumni

Hello, 



Thanks for your input. Please find answers below in italic:



1- Although we have a custom kfold option on the grid search (inner), we don't have the custom option for kfold in train/test (outer) performance eval?



At the moment, the custom kfold option in only available in the inner grid-search for finding hyperparameters. Thanks for the suggestion, we will see if we can add this feature on the outer train/test kfold in the future.



2-How to change the custom code sample on the grid search custom k fold option to allow for GroupKFold, ex: sklearn.model_selection.GroupKFold ?

You can find code samples on the GroupKFold in the code samples of the custom CV code screen. See below:





Note that at the moment it only works with integer columns which are passed as input to the model. We are looking to improve that in the future.





3-for the outer train/test kfold, how to use a custom column with fold if assignment?



See question 1: custom kfold on the train/test is not supported at the moment. Thanks for the input, it would be an interesting feature indeed.



In general, if you want to configure you cross-validation strategy in a custom way that is not available in the visual interface, I suggest exporting one of the visual Machine Learning models as Jupyter notebooks, and use it as a starting base to develop your own code.



Cheers,



Alexandre



 

UserBird
Dataiker
Author
Hi Alexandre, thanks for replying, looking into it

one question still, sorry, how to use the leave one out sample to do this by ClientId for example? I want to do cross fold in a way all records belonging to a client are on train or test fold, but never both? should by integer columns be fold ids? or user ids?

still unclear
thx!
0 Kudos
Alex_Combessie
Dataiker Alumni
Two cases:
1. Assuming you have several rows by ClientId:
In order to apply a leave-one-out strategy, you would use our code sample for DKULeaveOneGroupOut:
from dataiku.doctor.utils import crossval

# You need to select the column (of the design matrix) that is used to split the dataset
# This column is *after preprocessing* - so for example, categorical columns are not available
# anymore.

# To know the names of the columns after preprocessing, train a first model with regular crossval
# and find the names in the "Features" section of the model results.

# Note that the column will always be used for training

cv = crossval.DKULeaveOneGroupOut("")
# Client_id needs to be an integer

Note that it is not ideal since it means that the client_id will be fed to the model. So there is a slight risk of overfitting. We are looking to improve that in the future.

2. Assuming there is only one row per client id:
Directly use sklearn.model_selection.LeaveOneOut in the custom CV code screen ๐Ÿ˜‰
0 Kudos
UserBird
Dataiker
Author
ok checking it, thx Alexandre!
0 Kudos
omallet
Level 2
Hi Alexandre,
Do you have any idea of when this custom Cross Validation option will be available ?
Thank you,
Oscar
0 Kudos
Alex_Combessie
Dataiker Alumni
Do you mean the ability to code your own CV object for the "test" phase? As it is possible for the hyperparameter search phase?
0 Kudos
omallet
Level 2
Yes absolutly, it would be very convenient. I personnaly would love to be able to use GroupKFold cross validation in the test phase.
0 Kudos
Alex_Combessie
Dataiker Alumni
Thanks for the feedback, I have logged this to our product team.
0 Kudos