0 votes

I am attempting to train an ML model using a Visual Analysis and Spark.  However, the job fails with the following message:

[10:01:27] [INFO] [dku.utils]  - [2018/11/29-10:01:27.734] [task-result-getter-3] [ERROR] [org.apache.spark.scheduler.TaskSetManager]  - Total size of serialized results of 714 tasks (2.7 GB) is bigger than spark.driver.maxResultSize (2.0 GB)

This must mean that the job is collecting results into the driver process, but I am not sure what exactly it is collecting.  Can I configure the Visual Analysis to not collect any results?  Is there a way other than increasing spark.driver.maxResultSize to resolve this issue?

asked by

1 Answer

+1 vote

Spark MLLib is a distributed ML library which requires a lot of technical tuning, compared to other methods. In this case, fortunately, Spark MLLib gives you a recommendation of which parameter to tune. As Visual Analysis requires to collect results from MLLib to analyse the model performance, I advise following this recommendation and increase spark.driver.maxResultSize progressively, 1GB at a time.

Note that if your training set fits in your server memory, I would recommend using the scikit-learn/xgboost. It does not requires the advanced tuning of MLLib and usually perform better (as more algorithms are available).

Hope it helps,

answered by
edited by
Thank you Alex, that helps.  One follow-up question: does selecting the option "Skip expensive reports" reduce the amount of data collected by the Visual Analysis?
Indeed you can try that but note that it would disable some model performance screens.
994 questions
1,023 answers
3,027 users

┬ęDataiku 2012-2018 - Privacy Policy