Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a dataset. Four of the columns are Location Number, Employee ID, Date, and Symbol. I would like to create a 5th column that contains counts of how many times a row with matching data in all four columns appears. For example, there are 5 rows with Location Number=308, Employee ID=37, Date=03/23/2007 and Symbol=TP. The 5th column therefore is "5".
Any suggestions on how I could go about doing this?
Thanks in advance!
@davidkubica1 I think you should be able to accomplish this using a GroupBy Recipe followed by a join Recipe.
1. Group your dataset with the visual group by recipe on columns: Location Number, Employee ID, Date, and Symbol. Use "count" as the aggregation metric. This should result in a dataset that has the number of times each combination of those values occurs (with one row per unique combination of your group by keys in the output).
2. Left join the output of the GroupBy with your original dataset.
I think that should give you what you're after.
I'm not 100% knowledgeable about how the group by/joins work when using dates. Depending on if your original date column is an actual date (or a date represented as a string) you may have to convert those to string before step 1 and then can convert back to Date after step 2.
-Tim
great point about dates. Before doing the group by you may need to trim any time component from any date time stamps you are using in the group by.
Welcome to the Dataiku community.
I'm not at my desktop computer at the moment, so I canโt send along any pictures or anything.
However, you might check out the group by visual recipe.
There is an option where you can select the 4 columns you want to โgroup byโ.
Then select the check box that will tell you the number of time the set of 4 columns appears.
Run the recipe. This will give you fewer rows of data than was in your original dataset. One for each unique set of 4 columns and the count of those occurrences.
If you need to have all of the original rows with this added as an extra column. You should be able to use a join recipient to connect this summary count back to each of the original records.
There may be other ways to get this done more efficiently. Others please feel free to jump in with other ideas.
Please let us know how you get on with this and welcome to the community.