Legacy API calls in Spark

jmccartin · ‎03-06-2019

In dataiku's default pyspark recipes, dataiku.spark's get_dataframe takes a sqlContext to return a spark dataframe. This has been a legacy call to the API since Spark 2.0, as the entry point for SQL operations is now via the SparkSession, which has a few subtle, but important differences. While one can create a SparkSession manually, it doesn't appear to work with dataiku's dataframe API.

Please see here for specifics: https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.SQLContext

AdrienL · ‎03-06-2019

Hi,

It is indeed a deliberate choice, as a lot of people still use Spark 1.6 (still the default version in some widely-used Hadoop distributions). This way, the recipes you write for one Spark version still work if you switch (via project export, automation node or if your recipe is in a plugin or code sample) to another Dataiku instance that has a different Spark version.

Is there something specific that you can't do with the SQL Context / Spark Context and for which you'd need the Spark Session?

jmccartin · ‎03-06-2019

That's a valid answer that I did not consider (those poor folks still on 1.6!). It's not a major problem, but IRC there are some ways of operating with SQL/Hive tables that will not be future compatible when going through the sqlContext (it will eventually be deprecated).

Legacy API calls in Spark

Legacy API calls in Spark

Labels

API

code

Python

Spark

Sign up to take part

Legacy API calls in Spark

Legacy API calls in Spark

Labels

API

code

Python

Spark