0 votes

spark tasks in a single spark job,

if I run single spark recipe it runs as per the rule of block size i.e creating one task for 128 mb of block.

but if i run same spark job with spark pipeline it runs only 8/9 tasks (not more than this) no matter how big the cluster i choose, this information is noted from spark ui (we have 20 nodes of cluster but spark pipeline uses only 2 nodes meanwhile if  we run same job without pipeline it uses whole cluster)

spark pipeline (spark ui) image



Spark single recipe :

As seen from above images, while running recipes in spark pipeline it runs only 8/9 tasks while in normal spark recipe it uses whole cluster according to data size and block size

reopened by

1 Answer

0 votes
Best answer

From your screenshots, we can see that your Spark stage is properly parallelizable even in pipeline mode since it has 102 tasks. Scheduling tasks within a stage is not something that DSS has a say on, it's handled by Spark and YARN. It is very possible that your cluster or queue had some restriction at that time.

Please also note that in a pipeline, DSS will use the Spark configuration of the "latest" task of the pipeline. If they don't use the same number of executors, that could explain the difference. You should have a look at the executors page of your Spark UI to see if you ave the same number of executors.

Also, if you have dynamic allocation, behavior can be less predictable, especially on such extremely short jobs.

Please note that this community Q&A is more suited to generic questions rather than support questions about particular jobs. You can use the support portal for support questions about particular jobs.
selected by
1,298 questions
1,326 answers
11,863 users

©Dataiku 2012-2018 - Privacy Policy