0 votes

Hi,

I've got a partitioned dataset with IDs in one column.
The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions.

I then want to group this dataset and sum the transaction column.
Could you please confirm that when I do so, I'm not going to "lose" any of the IDs along the way?

For instance, let's take the example below

partition 1

ID| Transaction

1| 100

2| 200

partition 2

ID| Transaction

2| 300

Is the result the following table?

ID| Sum(transaction)

1| 100

2| 500

The reason why I'm asking is that when I'm sampling the partitions to take a look at them, I always take the first records. However I cannot find some of the ID that I can see in of some partitions in the output table (eg the output of the group by recipe). So I'm a bit worried I might have lost some data along the way.

Thank you very much for your help!

Best regards,

asked by

1 Answer

0 votes
Best answer

Hello,

All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.

Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.

Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:

Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.

Hope it helps,

Alex

answered by
selected by
Great, thanks a lot for this prompt answer :)
930 questions
957 answers
958 comments
1,808 users