Q & A
Governance & Security
Learn Dataiku DSS
Q & A
Ask a Question
Email or Username
I forgot my password
DSS Spark, via pyspark?
Hi, I was looking to get DSS integrated with spark but maybe I'm not quite understanding something. We have a server with pyspark on it, and I think the idea is to get DSS to connect to that, however looking at the DSS documentation it seems like DSS expects spark to be local to the DSS server.
Am I missing something?
Apr 20, 2016
to add a comment.
to answer this question.
In all cases Spark needs to be installed on the DSS server. Which does not mean that any Spark daemon has to run there, just that the code is present.
Then depending on the job submission parameter "spark.master", job executors can be run either locally (in which case the whole Spark job runs in a single JVM launched by the backend, so it's not really distributed computing, but apart from that it works the same) or in a Hadoop cluster (in which case the Spark job is driven from a JVM launched by the backend, but all real data processing work is done in several JVMs running on the Hadoop worker nodes).
Of course for the second mode to be possible, the local DSS machine must have access to the Hadoop cluster, and the local Spark installation must have been configured so that it knows how to contact this Hadoop cluster (ie have the Hadoop client code in its classpath, as well as the Hadoop configuration files).
A third mode that is not often used in production would indeed be to have a Spark standalone cluster up and running somewhere, accessible to the DSS server. Then again it should just be a matter of configuring correctly the spark.master configuration variable so that the Spark jobs launched by DSS can run tasks on this cluster.
It's not a mode that we really test but there is no reason it should not work.
May 4, 2016
ask related question
to add a comment.
Most popular tags
Spark on local machine (where DSS is intalled) + Spark on another cluster
Spark packages with DSS ?
Compatibilité DSS 2.2.1 et Spark 1.4
WebApps with PySpark at backend
How can I avoid a spark.driver.maxResultSize error when running a Visual Analysis
Welcome to Dataiku Answers, where you can ask questions and receive answers from other members of the community.
©Dataiku 2012-2018 -