+1 vote
Hello,

I have some duplicated rows on some key (in this case a phone number). At the very least, I wanted to flag all but the most recent. In an ideal situation, I wanted to flag the rest based on several conditions.

Is this possible in DSS out of the box? If not, what would be the appropriate steps to take?

Thank you for your support.

PS: I'm fairly new to DSS hence the question.
by

2 Answers

+3 votes
Best answer

Hi,

One way is the window recipe:

  • In the window definition, you partition by phone number and order by descending date
  • In the aggregation, you select Row number in the top "compute rank" part
  • In the post-filter, you add a filter to Keep only the rows that satisfies the condition rownumber == 1

In the post filter, you can add other conditions at will. In the output tab, you can remove the rownumber column if yo don't need it afterwards.

by
selected by
Thanks. I'll give this a try later today
Hi Adrian,

How do I get rid of the rank column in the output section? I'm currently using DSS 2.2.

Thanks
Err, my bad, I thought you could unselect some columns. This is not currently possible directly in the recipe.

You can either convert it to an SQL recipe and alter it directly if you're the coding type, else you'll have to append a preparation recipe where you delete that column.

Or, if you work directly on the resulting dataset with an Analysis (see DSS concepts: http://learn.dataiku.com/howto/getting-started/dss-concepts/universes-and-concepts.html ) (e.g. to build a model, make charts, etc.), dropping the column could be the first step in you analysis' preparation script.
I had a feeling :). Thanks again.
+1 vote

Depending on your use case, there are several possibilities. For a start, I would use a visual “group recipe”. It allows to keep only the line with the most recent timestamp.

- group by phone number
- in the recipe settings, on the line “phone number”, click “last, ordered by: <column containing time>”:

 

 

If you need to keep all lines, once you have selected the most recent phone number into a new dataset, you could also join this new with the original one: any line of the original dataset with a match in the new dataset is a correct phone number, any other line is to be flagged.

 

 

If you need to keep all lines, a window recipe should also do the trick, but this is more advanced.

by
Thanks. I'll give this a try later today
1,188 questions
1,220 answers
1,363 comments
11,751 users

©Dataiku 2012-2018 - Privacy Policy