Coming soon: We’re working on a brand new, revamped Community experience. Want to receive updates? Sign up now!

0 votes
Hi, I have a dataset with too few churn iterations (0.9 are non-churners) so I want to split the dataset into train and test set but I would like to have higher percentage of churner in the train test.
I tried to use the split recipe but I can't manage to get what I want (either I get the same representation or churners in the train set or a I get only churners)
related to an answer for: Issuing uploading file

1 Answer

0 votes

A way to achieve this is to do a splitting with "filters" mode, and define a filter by a formula.

For example, split into "train_set_with_more_churners" and "test_set_with_fewer_churners", use:

* A filter that sends into "train_set_with_more_churners" with formula like:
     if (churner == 1, rand() < 0.8, rand() < 0.5)

* Send all other values into "test_set_with_fewer_churners"

This way:
* 80% of churners will be sent to train set, 20% of churners to test set
* 50% of non-churners will be sent to train set, 50% to test set


If you have enough data and can afford to waste some, you can also use a sampling recipe in "class rebalancing" mode (but that will subsample so you will remove some non-churners)
Thanks for the advice !
I finally found an option to rebalance the sample before training the model. However, I don't know how much percents of churner I have in my data sample. Is there a way to know it ? To know if I need to retrain my model or not.
Hi Miguel, you can go to the dataset view, click on the column header where you have this information, and select 'Analyze' > categorical.
1,337 questions
1,362 answers
11,912 users

©Dataiku 2012-2018 - Privacy Policy