Is there a way to conditionally delete duplicates based on some key?

UserBird · ‎11-27-2015

Hello,

I have some duplicated rows on some key (in this case a phone number). At the very least, I wanted to flag all but the most recent. In an ideal situation, I wanted to flag the rest based on several conditions.

Is this possible in DSS out of the box? If not, what would be the appropriate steps to take?

Thank you for your support.

PS: I'm fairly new to DSS hence the question.

AdrienL · ‎11-27-2015

Hi,

One way is the window recipe:

In the window definition, you partition by phone number and order by descending date

In the aggregation, you select Row number in the top "compute rank" part

In the post-filter, you add a filter to Keep only the rows that satisfies the condition rownumber == 1

In the post filter, you can add other conditions at will. In the output tab, you can remove the rownumber column if yo don't need it afterwards.

View solution in original post

AdrienL · ‎11-27-2015

Hi,

One way is the window recipe:

In the window definition, you partition by phone number and order by descending date

In the aggregation, you select Row number in the top "compute rank" part

In the post-filter, you add a filter to Keep only the rows that satisfies the condition rownumber == 1

In the post filter, you can add other conditions at will. In the output tab, you can remove the rownumber column if yo don't need it afterwards.

UserBird · ‎11-28-2015

Thanks. I'll give this a try later today

UserBird · ‎11-28-2015

Hi Adrian,

How do I get rid of the rank column in the output section? I'm currently using DSS 2.2.

Thanks

AdrienL · ‎11-28-2015

Err, my bad, I thought you could unselect some columns. This is not currently possible directly in the recipe.

You can either convert it to an SQL recipe and alter it directly if you're the coding type, else you'll have to append a preparation recipe where you delete that column.

Or, if you work directly on the resulting dataset with an Analysis (see DSS concepts: http://learn.dataiku.com/howto/getting-started/dss-concepts/universes-and-concepts.html ) (e.g. to build a model, make charts, etc.), dropping the column could be the first step in you analysis' preparation script.

UserBird · ‎11-28-2015

I had a feeling :). Thanks again.

jrouquie · ‎11-27-2015

Depending on your use case, there are several possibilities. For a start, I would use a visual “group recipe”. It allows to keep only the line with the most recent timestamp.

- group by phone number

- in the recipe settings, on the line “phone number”, click “last, ordered by: <column containing time>”:

If you need to keep all lines, once you have selected the most recent phone number into a new dataset, you could also join this new with the original one: any line of the original dataset with a match in the new dataset is a correct phone number, any other line is to be flagged.

If you need to keep all lines, a window recipe should also do the trick, but this is more advanced.

UserBird · ‎11-28-2015

Thanks. I'll give this a try later today

Is there a way to conditionally delete duplicates based on some key?

Is there a way to conditionally delete duplicates based on some key?

Labels

Visual recipes

Sign up to take part

Is there a way to conditionally delete duplicates based on some key?

Is there a way to conditionally delete duplicates based on some key?

Labels

Visual recipes