+1 vote
In dataiku's default pyspark recipes, dataiku.spark's get_dataframe takes a sqlContext to return a spark dataframe. This has been a legacy call to the API since Spark 2.0, as the entry point for SQL operations is now via the SparkSession, which has a few subtle, but important differences. While one can create a SparkSession manually, it doesn't appear to work with dataiku's dataframe API.

Please see here for specifics: https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.SQLContext

1 Answer

0 votes

It is indeed a deliberate choice, as a lot of people still use Spark 1.6 (still the default version in some widely-used Hadoop distributions). This way, the recipes you write for one Spark version still work if you switch (via project export, automation node or if your recipe is in a plugin or code sample) to another Dataiku instance that has a different Spark version.

Is there something specific that you can't do with the SQL Context / Spark Context and for which you'd need the Spark Session?
That's a valid answer that I did not consider (those poor folks still on 1.6!). It's not a major problem, but IRC there are some ways of operating with SQL/Hive tables that will not be future compatible when going through the sqlContext (it will eventually be deprecated).
1,322 questions
1,341 answers
11,889 users

©Dataiku 2012-2018 - Privacy Policy