How to connect to Hadoop?

UserBird · ‎06-08-2015

We want to install DSS on a server which is not part of the Hadoop cluster.
What are the “clients libraries” needed for the HDFS connection?

jrouquie · ‎06-08-2015

The easiest way, when it is possible, is to include the DSS server in the set of servers managed by Cloudera Manager (Hosts / Add new host to cluster...), and configure a "HDFS gateway" role for it. This ensures that the proper client packages and configuration files are installed, and updated whenever the cluster configuration changes.

But DSS can also be installed on a server which is not part of the Hadoop cluster, like you want. This server needs client access to the cluster (and thus network access to all cluster nodes). It does not need to host any cluster roles like a datanode.

This normally involves installing:

the Hadoop client libraries (Java jars) suitable for your Hadoop distribution (we do not package them as they are largely distribution-specific),

the Hadoop configuration files (containing host:port and other parameters) so that client processes (including DSS) can find and connect to the cluster.

This should be done for several Hadoop components: at least for the HDFS and Yarn/MapReduce subsystems, and optionally for Hive / Pig / Impala if you plan to use these with DSS.

The precise way to do this depends on your Hadoop distribution, but should normally be documented in an “installing a client machine” section of the distribution manual. It normally boils down to:

configuring the distribution's package repository

installing a bunch of OS-level packages (which gets you all the jar files)

downloading the configuration files from the cluster management graphical interface and dropping them at the required location. (In the Cloudera Manager interface, this is HDFS page → "Actions" menu → "Download client configuration".)

The last two steps should be done whenever the cluster version or configuration is updated.

If you are using Cloudera Manager, you can also add the DSS server to the set of servers managed by the cluster manager (which installs the Hadoop package) and use Cloudera Manager to deploy “gateway” or “proxy” roles on it. This ensures these resources (JARs and configuration files) are maintained up to date with respect to the evolutions of the cluster. You'll typically need gateway roles for HDFS, Yarn or MapReduce, and Hive.

If you plan to store datasets managed by DSS in HDFS, setup a writable HDFS home directory for the dataiku user account (typically: /user/dataiku/). Also, Hive integration requires a dedicated writable Hive database.

In order to test hadoop connectivity, you can check that the following commands work from the Linux user account used by DSS:


hadoop version
hdfs dfs -ls /    # To test that the HDFS client configuration works
hdfs dfs -ls      # To test DSS's HDFS home directory
hive -e "show databases;"  # To test Hive connectivity (optional)
yarn node -list   # Lists the cluster executor nodes

Hadoop connection is automatically detected during installation, but if you configured Hadoop connection after installing DSS, you need to let DSS detect it by running:


DATADIR/bin/dss stop
DATADIR/bin/post-install  # for versions up to 2.0
DATADIR/bin/dssadmin install-hadoop-integration # for versions 2.1 and above
DATADIR/bin/dss start

See also

http://doc.dataiku.com/dss/latest/installation/hadoop.html

View solution in original post

jrouquie · ‎06-08-2015

The easiest way, when it is possible, is to include the DSS server in the set of servers managed by Cloudera Manager (Hosts / Add new host to cluster...), and configure a "HDFS gateway" role for it. This ensures that the proper client packages and configuration files are installed, and updated whenever the cluster configuration changes.

But DSS can also be installed on a server which is not part of the Hadoop cluster, like you want. This server needs client access to the cluster (and thus network access to all cluster nodes). It does not need to host any cluster roles like a datanode.

This normally involves installing:

the Hadoop client libraries (Java jars) suitable for your Hadoop distribution (we do not package them as they are largely distribution-specific),

the Hadoop configuration files (containing host:port and other parameters) so that client processes (including DSS) can find and connect to the cluster.

This should be done for several Hadoop components: at least for the HDFS and Yarn/MapReduce subsystems, and optionally for Hive / Pig / Impala if you plan to use these with DSS.

The precise way to do this depends on your Hadoop distribution, but should normally be documented in an “installing a client machine” section of the distribution manual. It normally boils down to:

configuring the distribution's package repository

installing a bunch of OS-level packages (which gets you all the jar files)

downloading the configuration files from the cluster management graphical interface and dropping them at the required location. (In the Cloudera Manager interface, this is HDFS page → "Actions" menu → "Download client configuration".)

The last two steps should be done whenever the cluster version or configuration is updated.

If you are using Cloudera Manager, you can also add the DSS server to the set of servers managed by the cluster manager (which installs the Hadoop package) and use Cloudera Manager to deploy “gateway” or “proxy” roles on it. This ensures these resources (JARs and configuration files) are maintained up to date with respect to the evolutions of the cluster. You'll typically need gateway roles for HDFS, Yarn or MapReduce, and Hive.

If you plan to store datasets managed by DSS in HDFS, setup a writable HDFS home directory for the dataiku user account (typically: /user/dataiku/). Also, Hive integration requires a dedicated writable Hive database.

In order to test hadoop connectivity, you can check that the following commands work from the Linux user account used by DSS:


hadoop version
hdfs dfs -ls /    # To test that the HDFS client configuration works
hdfs dfs -ls      # To test DSS's HDFS home directory
hive -e "show databases;"  # To test Hive connectivity (optional)
yarn node -list   # Lists the cluster executor nodes

Hadoop connection is automatically detected during installation, but if you configured Hadoop connection after installing DSS, you need to let DSS detect it by running:


DATADIR/bin/dss stop
DATADIR/bin/post-install  # for versions up to 2.0
DATADIR/bin/dssadmin install-hadoop-integration # for versions 2.1 and above
DATADIR/bin/dss start

See also

http://doc.dataiku.com/dss/latest/installation/hadoop.html

Sign up to take part

How to connect to Hadoop?

How to connect to Hadoop?

Labels

Hadoop