Building discrete time-partitioned datasets

JBR · ‎09-07-2017

Hello,

I have an input dataset coming from files that are created weekly with a structure including the day the file is created :

2017-01-01.csv

2017-01-08.csv

2017-01-15.csv

etc.

I have thus partitioned this dataset using a time-based partitioning using a "day" period.

QUESTION 1 : when I apply a recipe to this dataset and run it, it tries to build every day in the selected range, ie 2017-01-01 runs with success, but 2017-01-02, 2017-01-03, ..., 2017-01-07 fail. It's not a major problem since the recipe keeps running until the end but since the global status of the run is "failed", it's not optimal regarding scheduling and reporting. Is there a way around this?

QUESTION 2 : Since my initial dataset is quite heavy and increases in size every week, what I want to do is build it globally once, and then have a weekly scheduler build only the "newly created" partitions, and add them to my output dataset. Having researched the documentation, my understanding is that the way to do this would be to build all my recipes and partitions first (to create my initial up-to-date dataset), and then to edit each recipe to run only for a "D-7" time range, with the "append instead of overwrite" option checked. This is not my prefered option since it means that a global rebuild of the data (for instance if a recipe is modified) would require a complete re-edit of all recipes to restart the whole process. Is there a way to do this differently?

Thanks in advance, and sorry for the long post. 🙂

Julien

AlexT · ‎07-30-2023

Hi @JBR ,

1) If you have partitions with "gaps" you can use the option : "Missing partitions as empty"

2) Indeed if there are major changes in the schema or recipes you would need to rebuild all partitions.
With the option from 1, it should ignore any gaps in your time based partitions. I don't understand why you want to append exactly. Each partition should be built independently and you would inherently append new partitions you run if your flow is partitioned.

To run for the last 7 days you would simply want to have all partition dependencies set to equal and in the scenario build PREVIOUS_DAY and CURRENT_DAY and run this daily for example. If you need to run last 7 days you would need to use variables that you set or use code.

Thanks,

Building discrete time-partitioned datasets

Building discrete time-partitioned datasets

Labels

Datasets

Partitioning

Sign up to take part

Building discrete time-partitioned datasets

Building discrete time-partitioned datasets

Labels

Datasets

Partitioning