How can I avoid a spark.driver.maxResultSize error when running a Visual Analysis

jkonieczny
Level 2
How can I avoid a spark.driver.maxResultSize error when running a Visual Analysis

I am attempting to train an ML model using a Visual Analysis and Spark.  However, the job fails with the following message:



[10:01:27] [INFO] [dku.utils]  - [2018/11/29-10:01:27.734] [task-result-getter-3] [ERROR] [org.apache.spark.scheduler.TaskSetManager]  - Total size of serialized results of 714 tasks (2.7 GB) is bigger than spark.driver.maxResultSize (2.0 GB)



This must mean that the job is collecting results into the driver process, but I am not sure what exactly it is collecting.  Can I configure the Visual Analysis to not collect any results?  Is there a way other than increasing spark.driver.maxResultSize to resolve this issue?

0 Kudos
3 Replies
Alex_Combessie
Dataiker Alumni
Hi,

Spark MLLib is a distributed ML library which requires a lot of technical tuning, compared to other methods. In this case, fortunately, Spark MLLib gives you a recommendation of which parameter to tune. As Visual Analysis requires to collect results from MLLib to analyse the model performance, I advise following this recommendation and increase spark.driver.maxResultSize progressively, 1GB at a time.

Note that if your training set fits in your server memory, I would recommend using the scikit-learn/xgboost. It does not requires the advanced tuning of MLLib and usually perform better (as more algorithms are available).

Hope it helps,

Alex
jkonieczny
Level 2
Author
Thank you Alex, that helps. One follow-up question: does selecting the option "Skip expensive reports" reduce the amount of data collected by the Visual Analysis?
0 Kudos
Alex_Combessie
Dataiker Alumni
Indeed you can try that but note that it would disable some model performance screens.
0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku