Is there a way to conditionally delete duplicates based on some key?

Solved!
UserBird
Dataiker
Is there a way to conditionally delete duplicates based on some key?
Hello,

I have some duplicated rows on some key (in this case a phone number). At the very least, I wanted to flag all but the most recent. In an ideal situation, I wanted to flag the rest based on several conditions.

Is this possible in DSS out of the box? If not, what would be the appropriate steps to take?

Thank you for your support.

PS: I'm fairly new to DSS hence the question.
1 Solution
AdrienL
Dataiker

Hi,



One way is the window recipe:




  • In the window definition, you partition by phone number and order by descending date

  • In the aggregation, you select Row number in the top "compute rank" part

  • In the post-filter, you add a filter to Keep only the rows that satisfies the condition rownumber == 1



In the post filter, you can add other conditions at will. In the output tab, you can remove the rownumber column if yo don't need it afterwards.

View solution in original post

7 Replies
AdrienL
Dataiker

Hi,



One way is the window recipe:




  • In the window definition, you partition by phone number and order by descending date

  • In the aggregation, you select Row number in the top "compute rank" part

  • In the post-filter, you add a filter to Keep only the rows that satisfies the condition rownumber == 1



In the post filter, you can add other conditions at will. In the output tab, you can remove the rownumber column if yo don't need it afterwards.

UserBird
Dataiker
Author
Thanks. I'll give this a try later today
0 Kudos
UserBird
Dataiker
Author
Hi Adrian,

How do I get rid of the rank column in the output section? I'm currently using DSS 2.2.

Thanks
0 Kudos
AdrienL
Dataiker
Err, my bad, I thought you could unselect some columns. This is not currently possible directly in the recipe.

You can either convert it to an SQL recipe and alter it directly if you're the coding type, else you'll have to append a preparation recipe where you delete that column.

Or, if you work directly on the resulting dataset with an Analysis (see DSS concepts: http://learn.dataiku.com/howto/getting-started/dss-concepts/universes-and-concepts.html ) (e.g. to build a model, make charts, etc.), dropping the column could be the first step in you analysis' preparation script.
0 Kudos
UserBird
Dataiker
Author
I had a feeling :). Thanks again.
0 Kudos
jrouquie
Dataiker Alumni

Depending on your use case, there are several possibilities. For a start, I would use a visual โ€œgroup recipeโ€. It allows to keep only the line with the most recent timestamp.



- group by phone number

- in the recipe settings, on the line โ€œphone numberโ€, click โ€œlast, ordered by: <column containing time>โ€:



 



 



If you need to keep all lines, once you have selected the most recent phone number into a new dataset, you could also join this new with the original one: any line of the original dataset with a match in the new dataset is a correct phone number, any other line is to be flagged.



 



 



If you need to keep all lines, a window recipe should also do the trick, but this is more advanced.

UserBird
Dataiker
Author
Thanks. I'll give this a try later today
0 Kudos

Labels

?
Labels (1)
A banner prompting to get Dataiku