0 votes
Hi there, was checking for a guide/scripts on this specific scenario.

also to know if it is doable/possible scenario?

Docker dataiku dss container is running on a cloudera CDH host. Cloudera cluster needs kerberos ticket for authentication.

After reading available materials, thinking the best approach would be to install hadoop binaries and map/copy hadoop conf& kerberos conf on to the dss running container? Will this work?

Any existing scripts/material on how to do this properly? (cloudera cdh)

asked by

1 Answer

+2 votes
Best answer
Hi Rui

the approach you propose is indeed the correct one.

For kerberos to work, you need to install the Kerberos client package inside the Docker image, and mount the krb5.conf configuration file

For Hadoop, you will need to install the CDH client packages, and mount the various configuration directories (/etc/{hadoop,hive,spark,...}/conf). Beware that depending on the way your CDH cluster is setup, you may have a number of symlink indirections in there.

Another difficulty related to Spark is that the Spark workers (running on the cluster nodes) need to be able to connect back to the Spark driver (running on the DSS host), which imposes extra constraints on the way the container network is configured.

All in all it is a workable setup, though you need some understanding of the inners of Hadoop to configure it correctly. We have already done it a few times but do not have readily-exportable materials for it. Do not hesitate to come back to us if you get into difficulties

Patrice Bertin
answered by
selected by
Hi Patrice, thanks! additional question, if I go the other way around, trying to add the dss container instance as a hadoop node through the clouder manager interface? (so that install everything automatically) Do you think this could work? Any experience doing this with docker container? (as dss is not on a "real" host)
In practice this should probably work but I have never attempted this approach. I would doubt it is simpler, as Cloudera is quite strict in checking the configuration of managed hosts, and is more designed to the managed of static hosts. It might be worth a try though.
You will then definitely need the container to be reachable from outside with a "normal" network stack (no nat, and a globally-known hostname).
ok Patrice, agree, nice info, thanks!
Hi Patrice, getting close, but blocked on an issue, if you can help
-kerberos, hadoop dfs -ls is working properly, installed spark, can also submit jobs and see them on cluster, changed some spark ports and allow them through docker, checked  spark cluster can connect back to docker dss spark driver
ex: this test works properly
cd /usr/local/spark/
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \

but when trying to run dss spark setup, I get the below error

checked that I also cannot reach python by default on bash (root or dataiku user)

ideas? what am I missing?

dataik[email protected]:~/dss$ ./bin/dssadmin install-spark-integration
[+] Saving installation log to /home/dataiku/dss/run/install.log
*** Error detecting SPARK_HOME using spark-submit
Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:91)
        at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
        at java.lang.ProcessImpl.start(ProcessImpl.java:134)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 11 more
Hi Rui,

first of all I would prefer continuing this thread on a support ticket at support.dataiku.com, as it is getting very specific to your particular setup and might still need a couple more back-and-forths

At first glance, you are missing the python subsystem which is used by spark to submit python files (as in:  spark-submit file.py). DSS actually does this using a small test python file as part of the install-spark-integration script, here spark-submit fails because it cannot find python itself.

That should be easy to reproduce outside DSS. To fix it, you should probably install python, or fix the spark config so that it properly locates the python subsystem which you intend it to use

Patrice Bertin
Hi Patrice, created the ticket, thanks. After python setup I was able to proceed, but now remote spark execution seems to be looking for the container "virtual" hostname, which wont work from outside. More info on the ticket.
792 questions
816 answers
533 users