What is the maximum number of scenario allowed in a project

dss-slb

Hi all,

Questions:

Is there any limit on the number of scenario allowed in a project?
Is there any limit on how many scenario can run simultaneously?

Thanks

Turribeach

@dss-slb wrote:
@Turribeach
I have an ETL workflow created in my Dataiku project's Flow where it has a source dataset, some intermediate datasets for data transformation and then write to a target dataset. Then I am going to create a scenario to build the target dataset triggered by an event when new data arrives at the source dataset. The connection settings for the target dataset are defined as scenario variables.

Most of wht you have above are not requirements but how you implemented them. From the above I can determine that have input datasets of the same schema in different databases which need to be processed in a similar way, is this correct?

@dss-slb wrote:
1. The workflow can write to a single target dataset only.

This is not a requirement. In fact the next "requirement" contradicts it since it says:

@dss-slb wrote:
2. The workflow must be able to run on different target datasets and be run concurrently as soon as new source data is available.

The first part is requirement but the concurrent part is not a requirement, it's an implementation approach. Your requirement in this case could be that new data needs to be processed within 5 mins of being available. If a non-concurrent solution achieves this why make it more complicated than it needs to be? Why do you need concurrency?

@dss-slb wrote:
3. It should be contained in a single Dataiku project and the project will be deployed to a new Dataiku instance using a python script.

Neither of these two part statements are requirements. Why does it need to be a single project? What requirement does that achieve? Deployment is not really something that should be covered as a requirement. The usually approach for Dataiku projects is to deploy to Automation node via project bundle. This can be automated with a Python script.

Based on the additional information you have provided I believe at this point you should be looking at Dataiku Applications. Dataiku Applications allow you to re-use flows within multiple projects and process data in a similar way. Any changes in the Dataiku Application the project are using will result in changes in the child projects.

View solution in original post

SarinaS

Hi @dss-slb,

There isn't a limit to how many scenarios you can create within a project. In terms of activity concurrency, this will depend on your activity concurrency limits defined under Administration > Resources Control:
https://doc.dataiku.com/dss/latest/flow/limits.html

Note that an individual scenario can only be run once (i.e. the same scenario cannot be run concurrently).

Thank you,
Sarina

dss-slb

Hi @SarinaS

Thanks for the clarification.

I have couple follow up questions on scenario.

If I duplicate a scenario, can I run the original and the duplicated scenario concurrently?

Is there an API in Dataiku python sdk to copy/duplicate a scenario?

Thanks

Turribeach

@dss-slb wrote:
If I duplicate a scenario, can I run the original and the duplicated scenario concurrently?

It depends of what the scenario does. Same flow, no. Have a read at this response I recently wrote since it covers many of the quiestion you may have: https://community.dataiku.com/t5/Using-Dataiku/Help-with-Application-setup/m-p/42787/highlight/true#...

@dss-slb wrote:
Is there an API in Dataiku python sdk to copy/duplicate a scenario?

There is no copy/duplicate API but you can easily get the settings of one scenario:

https://developer.dataiku.com/latest/api-reference/python/scenarios.html#dataikuapi.dss.scenario.DSS...

and pass it using the definition= parameter in create_scenario:

https://developer.dataiku.com/latest/api-reference/python/projects.html#dataikuapi.dss.project.DSSPr...

dss-slb

@Turribeach ,

Thanks for very detailed information that asked several of my questions

As you mentioned that 2 scenarios cannot be run concurrently if they are using the same flow.

I would like to know if I have defined a flow in Zone A and then copied the zone to Zone B (with the "copy zone content" option checked) . Now I defined a scenario, SC_A on the flow defined in Zone A, and another scenario, SC_B on the flow defined in Zone B (which is a copy of Zone A).

Now, can I run both scenarios, SC_A, and SC_B concurrently?

Thanks in advance for your time and advice.

Turribeach

Can you please detail what your requirement is rather how you thing you can achieve it?

dss-slb

@Turribeach

I have an ETL workflow created in my Dataiku project's Flow where it has a source dataset, some intermediate datasets for data transformation and then write to a target dataset. Then I am going to create a scenario to build the target dataset triggered by an event when new data arrives at the source dataset. The connection settings for the target dataset are defined as scenario variables.

The requirements for my workflow are the followings:

1. The workflow can write to a single target dataset only.

2. The workflow must be able to run on different target datasets and be run concurrently as soon as new source data is available.

3. It should be contained in a single Dataiku project and the project will be deployed to a new Dataiku instance using a python script.

Turribeach

@dss-slb wrote:
@Turribeach
I have an ETL workflow created in my Dataiku project's Flow where it has a source dataset, some intermediate datasets for data transformation and then write to a target dataset. Then I am going to create a scenario to build the target dataset triggered by an event when new data arrives at the source dataset. The connection settings for the target dataset are defined as scenario variables.

Most of wht you have above are not requirements but how you implemented them. From the above I can determine that have input datasets of the same schema in different databases which need to be processed in a similar way, is this correct?

@dss-slb wrote:
1. The workflow can write to a single target dataset only.

This is not a requirement. In fact the next "requirement" contradicts it since it says:

@dss-slb wrote:
2. The workflow must be able to run on different target datasets and be run concurrently as soon as new source data is available.

The first part is requirement but the concurrent part is not a requirement, it's an implementation approach. Your requirement in this case could be that new data needs to be processed within 5 mins of being available. If a non-concurrent solution achieves this why make it more complicated than it needs to be? Why do you need concurrency?

@dss-slb wrote:
3. It should be contained in a single Dataiku project and the project will be deployed to a new Dataiku instance using a python script.

Neither of these two part statements are requirements. Why does it need to be a single project? What requirement does that achieve? Deployment is not really something that should be covered as a requirement. The usually approach for Dataiku projects is to deploy to Automation node via project bundle. This can be automated with a Python script.

Based on the additional information you have provided I believe at this point you should be looking at Dataiku Applications. Dataiku Applications allow you to re-use flows within multiple projects and process data in a similar way. Any changes in the Dataiku Application the project are using will result in changes in the child projects.

Sign up to take part

What is the maximum number of scenario allowed in a project

What is the maximum number of scenario allowed in a project

Setup info