Missing ID in partitioned group by

suard_raphaelle · ‎03-13-2018

Hi,

I've got a partitioned dataset with IDs in one column.

The dataset registers some transactions: it may well be that some IDs do not appear in all the transactions.

I then want to group this dataset and sum the transaction column.

Could you please confirm that when I do so, I'm not going to "lose" any of the IDs along the way?

For instance, let's take the example below

partition 1

ID| Transaction

1| 100

2| 200

partition 2

ID| Transaction

2| 300

Is the result the following table?

ID| Sum(transaction)

1| 100

2| 500

The reason why I'm asking is that when I'm sampling the partitions to take a look at them, I always take the first records. However I cannot find some of the ID that I can see in of some partitions in the output table (eg the output of the group by recipe). So I'm a bit worried I might have lost some data along the way.

Thank you very much for your help!

Best regards,

Alex_Combessie · ‎03-13-2018

Hello,

All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.

Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.

Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:

Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.

Hope it helps,

Alex

View solution in original post

Alex_Combessie · ‎03-13-2018

Hello,

All visual recipes in DSS operate on the full data, so without sampling. The sampling we apply when you visualize a dataset or work on a visual "Prepare" recipe is only for you to be able to prototype and understand your data quickly. But when you actually run a recipe, it is applied to the full data.

Hence, if you create a "Group" recipe, taking the sum of "transactions" by "ID", it will perform what you want. From your example, I understand that you want the output of this group recipe to be "non-partitioned". Make sure you select this option when creating the output dataset.

Then, make sure the partition dependency setting is "All available" in the Input/Output tab of your Group recipe:

Note that if what you wanted to perform the sum of transactions by ID separately on each partition, you could do that with a partitioned output and the "Equals" partition dependency setting in the screen above. If you plan to work with partitions in DSS, I encourage you to read: https://doc.dataiku.com/dss/latest/concepts/index.html.

Hope it helps,

Alex

suard_raphaelle · ‎03-13-2018

Great, thanks a lot for this prompt answer 🙂

Missing ID in partitioned group by

Missing ID in partitioned group by

Labels

Datasets

Partitioning

Sampling

Sign up to take part

Missing ID in partitioned group by

Missing ID in partitioned group by

Labels

Datasets

Partitioning

Sampling