Coming soon: We’re working on a brand new, revamped Community experience. Want to receive updates? Sign up now!

0 votes

I am attempting to train an ML model using a Visual Analysis and Spark.  However, the job fails with the following message:

[10:01:27] [INFO] [dku.utils]  - [2018/11/29-10:01:27.734] [task-result-getter-3] [ERROR] [org.apache.spark.scheduler.TaskSetManager]  - Total size of serialized results of 714 tasks (2.7 GB) is bigger than spark.driver.maxResultSize (2.0 GB)

This must mean that the job is collecting results into the driver process, but I am not sure what exactly it is collecting.  Can I configure the Visual Analysis to not collect any results?  Is there a way other than increasing spark.driver.maxResultSize to resolve this issue?


1 Answer

+1 vote

Spark MLLib is a distributed ML library which requires a lot of technical tuning, compared to other methods. In this case, fortunately, Spark MLLib gives you a recommendation of which parameter to tune. As Visual Analysis requires to collect results from MLLib to analyse the model performance, I advise following this recommendation and increase spark.driver.maxResultSize progressively, 1GB at a time.

Note that if your training set fits in your server memory, I would recommend using the scikit-learn/xgboost. It does not requires the advanced tuning of MLLib and usually perform better (as more algorithms are available).

Hope it helps,

edited by
Thank you Alex, that helps.  One follow-up question: does selecting the option "Skip expensive reports" reduce the amount of data collected by the Visual Analysis?
Indeed you can try that but note that it would disable some model performance screens.
1,339 questions
1,365 answers
11,916 users

©Dataiku 2012-2018 - Privacy Policy