Coming soon: We’re working on a brand new, revamped Community experience. Want to receive updates? Sign up now!

0 votes
I have a dataset too big to fit in memory, so I want to down sample it.
But the two classes to predict are unbalanced: there are many more 0's than 1's in the target column.
by

1 Answer

0 votes
Best answer

We do not have a direct recipe to do so.
The fastest way is probably as a SQL recipe. For instance in Hive:
 

SELECT * FROM train_set WHERE target = 1
UNION ALL
SELECT * FROM (SELECT * FROM train_set WHERE target = 0 ORDER BY rand() LIMIT 1000000) foo

 

by
1,339 questions
1,365 answers
1,557 comments
11,916 users

©Dataiku 2012-2018 - Privacy Policy