0 votes
Is it somehow possible to oversample my dataset?

for example, I have such records and target variables

1 2 3 | 5
2 2 3 | 6
1 1 1 | 1
3 2 2 | 5

I want to duplicate (or generate more than one duplicate)  row #3 and make my dataset looks as follows:
1 2 3 | 5
2 2 3 | 6
1 1 1 | 1
1 1 1 | 1
3 2 2 | 5

How can I do this?
Thank you in advance!
asked by Sergey
It's not very clear from your example on which criterion you want to oversample ? Is it based on the counts in the target variable ?
Strictly speaking - yes. It is based on the counts in the target. When I have multiple records that lies between 5-6 and small number of records out of this range, I need to oversample the training set and make it more balanced. Especially when it is needed to found some reasons which lead to rare target values (in this case - 1).

1 Answer

0 votes
DSS does not have a builtin oversampling mechanism.

DSS has a "class rebalancing" sampling method. You could use it, either for the Explore / Prepare view, as dataset sampling in machine learning, or in a dedicated sampling recipe that will give you more balanced data.

However, this "class rebalancing" sampling method only undersamples, it never oversamples. It is also best suited for columns with reasonably low cardinality.

At the moment, if you want to oversample some rows, the best way would be to use a Python recipe (assuming your dataset fits in memory) or a PySpark/SparkR/Spark-Scala recipe else.
answered by
657 questions
655 answers
490 comments
414 users