Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I installed Spark in a notebook environment. On creating the new pyspark notebook I get the following starter code:
.../
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
dataset = dataiku.Dataset("name_of_the_dataset")
df = dkuspark.get_dataframe(sqlContext, dataset)
.../
The issue is that I have spark version 3.2.1 and since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. So I am creating Spark session as follows:
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate() # cluster ip
Therefore running the following line gives me error
df = dkuspark.get_dataframe(sqlContext, dataset)
Error:
Py4JJavaError: An error occurred while calling o32.classForName. : java.lang.ClassNotFoundException: com.dataiku.dip.spark.StdDataikuSparkContext
Hi,
the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see https://doc.dataiku.com/dss/latest/spark/installation.html ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.
Hi,
the spark submit arguments aren't passing the needed jars to Spark, which means you probably haven't done the integration of Spark with DSS (see https://doc.dataiku.com/dss/latest/spark/installation.html ). On a related note, make sure you don't install pyspark as a package in your code env, since that should be handled by the install-spark-integration script.
Hi,
I did the spark integration with DSS. I am creating a Spark session as mentioned above. I need the updated DSS code to import data as Spark dataframe. I've read the documentation, but I can't seem to find the answer.
once you have your Spark SQLContext object, you can simply
import dataiku.spark as dkuspark
# Example: Read the descriptor of a Dataiku dataset
mydataset = dataiku.Dataset("mydataset")
# And read it as a Spark dataframe
df = dkuspark.get_dataframe(sqlContext, mydataset)