Coming soon: We’re working on a brand new, revamped Community experience. Want to receive updates? Sign up now!

0 votes
I have a sync recipe with one partitioned dataset in input and one partitioned dataset in output. Partitioning is by hour.

The input dataset receives new data continuously. Today I manually build the output recipe by selecting new dates, using the append instead of overwrite options.

This is obviously not optimal, as it involves manual intervention.

What would be a solution to only sync the partition from the input that are not in the output? (other than job scheduling, which could be too costly)
Hi Alexandre,

The "append instead of overwrite" option has been asked by some clients, but is for very specific uses. In particular, it means that if you run a recipe twice while input has not changed, the output of the recipe will appear twice in the output dataset. This is probably not what you need.

What would be too costly in scheduling a job in Administration → scheduler?
append instead of overwrite is really useful in my "Streaming" use case... It would be great to have the ability to do an automatic selection of which partitions to sync. Is this possible?

It could be costly to do a scheduler because it is actually not just a sync but a preparation recipe, which runs very slowly on my server. I will try to do it on a test dataset, but it is not my preferred option.
So you have a preparation recipe that is costly to run. But I dont understand how it could be more costly to run this recipe launched by the DSS scheduler instead of launched manually. (Note that you can specify which partitions to build in the scheduler)
Actually I already have 5 scheduled sync recipes, so the idea would be to add 5 scheduled preparation recipes. I was afraid it would be too much for my server, but I just tried and it seems to work fine. Thanks!

Please log in or register to answer this question.

1,337 questions
1,364 answers
11,916 users

©Dataiku 2012-2018 - Privacy Policy