Coming soon: We’re working on a brand new, revamped Community experience. Want to receive updates? Sign up now!
There does not appear to be a way to write spark jobs to disk using a set partition scheme. This is normally done via dataframe.write.parquet(<path>, partitionBy=['year']), if one is to partition the data by year, for example. I am looking at the API page here: https://doc.dataiku.com/dss/latest/python-api/pyspark.html, specifically the function: write_with_schema .
What are my options here? Since this is an important requirement for us, what's to stop me from simply using the sqlContext to write to a fixed path in HDFS, using the command I gave above? Can this be hacked somehow, or by using a plugin?