How to split dataset based on the value of a column and define the number of the output datasets

stephl
Level 2
How to split dataset based on the value of a column and define the number of the output datasets

Hello, I have a dataset with millions rows in the format as below, and want to split it into two datasets A and B. Is it possible to do by using the visual recipes of Dataiku (without coding)?

Dataset A: data with only "target_flg" = 1

Dataset B: data with only "target_flg" = 0, but instead of exporting all rows where "target_flg" = 0, I want to exact the data by random and ensure the numbers of rows of Dataset B is the exact 9 times of Dataset A.

(In the original dataset, the rows of "target_flg" = 0 is over 9 times of "target_flg" = 1)

target_flgCol_ACol_BCol_C
1AAABBBDDD
0THJJGGDDS
0EEEDDDJGG
0WWDJGGDDD
1DDDDDDBBB
1EEEBBBEEE
0BBBJGGAAA
0AAAAAAJGG

 

I've tried to use the Split Recipe but it seems there is no options to define the numbers of output datasets based on a particular dataset.

 

Ultimately, I want to join Dataset A and B together to build a ML model.

 

 

0 Kudos
4 Replies
Turribeach

You can either split by percentage or you can split by value, but you can't combine the two together. Having said that you are saying that "In the original dataset, the rows of "target_flg" = 0 is over 9 times of "target_flg" = 1" so using a Split Recipe and splitting by value "target_flg" = 1 for Dataset A and "target_flg" = 0 for Dataset B will leave Dataset B being 9 times bigger than Dataset A, which is what you want right? So I don't really see what your problem is.

stephl
Level 2
Author

Thanks @Turribeach .

I would like the number of rows of Dataset B to be the EXACT 9 times of Dataset A... 

It seems that is impossible without coding....?

0 Kudos
Turribeach

The Splitting method "Randomly dispatch data" allows you to set a ratio for the split. So if you set 10% for one dataset the other will be 90% which is 9 times bigger. 

There might be a way to get your both requirements after all. If you select the Splitting method "Dispatch percentiles of sorted data" you can set a sort order and also a ratio for the split. So if you set the sort to be target_flg descending and set it to be 10% you will get 10% in Table A with target_flg = 1 and the remaining 90% will be in Table B. Make sure you create a new dataset for the remaining data (Output => Other dataset...). 

stephl
Level 2
Author

Thanks so much, @Turribeach 🙂

I tried it out.

Since the "target_flg = 1" is not always exact the 10% of the whole dataset, with this method, it seems Dataset A inevitably contains "target_flg = 0"....

 

Or is there any way to split the data to meet this requirement when building the model?

0 Kudos