Read as Spark dataframe

Dawood154 · ‎08-24-2022

I installed Spark in a notebook environment. On creating the new pyspark notebook I get the following starter code:

.../

from pyspark import SparkContext

from pyspark.sql import SQLContext

sc = SparkContext()

sqlContext = SQLContext(sc)

dataset = dataiku.Dataset("name_of_the_dataset")

df = dkuspark.get_dataframe(sqlContext, dataset)

.../

The issue is that I have spark version 3.2.1 and since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. So I am creating Spark session as follows:

spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate() # cluster ip

Therefore running the following line gives me error

df = dkuspark.get_dataframe(sqlContext, dataset)

Error:

Py4JJavaError: An error occurred while calling o32.classForName. : java.lang.ClassNotFoundException: com.dataiku.dip.spark.StdDataikuSparkContext

fchataigner2 · ‎08-24-2022

Hi,

the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see https://doc.dataiku.com/dss/latest/spark/installation.html ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.

View solution in original post

fchataigner2 · ‎08-24-2022

Hi,

the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see https://doc.dataiku.com/dss/latest/spark/installation.html ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.

Dawood154 · ‎09-04-2022

Hi,

I did the spark integration with DSS. I am creating a Spark session as mentioned above. I need the updated DSS code to import data as Spark dataframe. I've read the documentation, but I can't seem to find the answer.

fchataigner2 · ‎09-05-2022

once you have your Spark SQLContext object, you can simply

import dataiku.spark as dkuspark
# Example: Read the descriptor of a Dataiku dataset
mydataset = dataiku.Dataset("mydataset")
# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)

Sign up to take part

Read as Spark dataframe

Read as Spark dataframe

Setup info