0 votes

Please see the workflow above. I would like to apply the lower part of the workflow, that is currently applied only to one file, to all files in a directory, and then concatenate all results. How can I do this with dataiku?

asked by
I was thinking to write a python script that reads in all the files into one big file and then just replace the testdata_stream dataset with this huge file.
If this is possible I guess it means all your files have the same schema?
In that case you can also use a stack recipe, and split your results at the end.

If you have many many files and don't want to see them all in your flow, the python recipe is the way to go, with a folder as input!
What if I want to use the lower part of the workflow as a prediction_API endpoint. How can I bundle all this into the endpoint, without having to rewrite everything in pure python?
You can only deploy models to an API endpoint, but that includes the preparation script of your lab that was used to create the model. So if you can merge all steps into a single analysis, you could deploy it, but your scoring may be a bit longer because of the prepare steps.

The real question is probably about what you will want to query the API with. If the data needs preparation, you need to include all preparation in the deployed model.


Note that if you're looking to refresh parts of the flow, e.g. on a daily basis, what you need is an automation node.
Yes, I definitely need the preprocessing steps, since I want to query the streaming API with a row of the current static input file.

So how can I merge "all steps into a single analysis" ?
So I guess an automation node would be another option. Whenever a new file is in the filesystem trigger the whole prediction flow. What is the best way to implement this with DSS?
Is it this link:
https://doc.dataiku.com/dss/latest/bundles/index.html
In this scenario: How do I make clear to the automation node, to use the new available data instead of the input data of development workflow?
or this link:
https://doc.dataiku.com/dss/latest/scenarios/index.html
you are referring to?

Does this mean that it is not possible with the free edition of DSS?
Triggering a scenario based on periodic file change is indeed a use case for the automation node!
You would use a simple scenario that rebuilds datasets, with a trigger on dataset change.

You are right that these features are part of the enterprise edition only...

1 Answer

0 votes

Summing up above discussion:

  • You can only deploy models (and the accompanying analysis script) to an API node. This seems to be what you want.
  • To deploy in production a whole flow, you need to use an automation node.
  • This enables automatic rebuilding of the flow, including only on latest data.

All these features are part of the enterprise edition only.

answered by
651 questions
648 answers
486 comments
408 users