Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hello,
I have some duplicated rows on some key (in this case a phone number). At the very least, I wanted to flag all but the most recent. In an ideal situation, I wanted to flag the rest based on several conditions.
Is this possible in DSS out of the box? If not, what would be the appropriate steps to take?
Thank you for your support.
PS: I'm fairly new to DSS hence the question.
Hi,
You can could count duplicate phone numbers using a visual Group By visual recipe and later join original dataset with the datasets from the group by recipe .
You can also use a python recipe with pandas duplicated() for example :
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
If you want to drop rows based on duplicates in a single column you can use drop_duplicates()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
df.drop_duplicates(subset=['phone_number'], keep='last')
Not sure what you mean by flagging the rest by several conditions; can you elaborate a bit? There are several visual processors to flag rows:
As you get more advanced with the question you are asking one of the things you will likely discover about duplicates is that they can be tricky. Although you definitely can reduce the number of duplicates in your data set it is often not possible to find and remove all. So having some reasonable expectations about the completeness of the results can be helpful.
In the case of phone numbers there are often several ok, correct ways to write a phone number. Including things like with and without extensions, with and without international dialing codes, with and without long distance prefixes.
One of the ways of improving your match rate for duplicates is to standardize the fields before looking for duplicates. Dataiku does not have a tool to do this directly built in. However, Dataiku DSS does allow for the use of libraries from other languages like Python and R. One of the more advanced approaches to the problem I suspect you are trying to solve is to work on standardizing the phone numbers before looking for dupes. There are lots of ways to try to do this, however the approach that Iโll often use is to use a library that someone else has written for this purpose. In the case of phone numbers something like phone numbers library in python can help. https://pypi.org/project/phonenumbers/ .
Your subject references โmy iPhoneโ. Regarding your question Iโm wondering how the iPhone is involved in your question about duplicates. Can you share a bit more about that if it is important?