Missing ID in partitioned group by

Solved!
suard_raphaelle
Level 1
Missing ID in partitioned group by

Hi,



I've got a partitioned dataset with IDs in one column.

The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions.



I then want to group this dataset and sum the transaction column.

Could you please confirm that when I do so, I'm not going to "lose" any of the IDs along the way?



For instance, let's take the example below



partition 1



ID| Transaction



1| 100



2| 200



partition 2



ID| Transaction



2| 300



Is the result the following table?



ID| Sum(transaction)



1| 100



2| 500



The reason why I'm asking is that when I'm sampling the partitions to take a look at them, I always take the first records. However I cannot find some of the ID that I can see in of some partitions in the output table (eg the output of the group by recipe). So I'm a bit worried I might have lost some data along the way.



Thank you very much for your help!



Best regards,

0 Kudos
1 Solution
Alex_Combessie
Dataiker Alumni

Hello,



All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.



Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.





Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:





Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.



Hope it helps,



Alex

View solution in original post

0 Kudos
2 Replies
Alex_Combessie
Dataiker Alumni

Hello,



All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.



Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.





Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:





Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.



Hope it helps,



Alex

0 Kudos
suard_raphaelle
Level 1
Author
Great, thanks a lot for this prompt answer ๐Ÿ™‚
0 Kudos

Labels

?
Labels (3)
A banner prompting to get Dataiku