0 votes

There does not appear to be a way to write spark jobs to disk using a set partition scheme. This is normally done via dataframe.write.parquet(<path>, partitionBy=['year']), if one is to partition the data by year, for example. I am looking at the API page here: https://doc.dataiku.com/dss/latest/python-api/pyspark.html, specifically the function: write_with_schema

What are my options here? Since this is an important requirement for us, what's to stop me from simply using the sqlContext to write to a fixed path in HDFS, using the command I gave above? Can this be hacked somehow, or by using a plugin?

by
I can't seem to look up how to override the write_with_schema call. Following the instructions here: https://doc.dataiku.com/dss/latest/python-api/outside-usage.html - 'spark' does not appear to be a module in the tarball (dataiku-internal-client-5.1.0). Any reason why you are trying to hide that part of the API?

1 Answer

0 votes
Hi,

In order to use partitioning in Dataiku, you need to specify it on the output (and possibly input) dataset. You can find more details on this page: https://doc.dataiku.com/dss/latest/partitions/fs_datasets.html.

If you set it up accordingly, this file system partitioning setup will be applied to all recipes, including those running on Spark.

Hope it helps,

Alex
by
1,157 questions
1,192 answers
1,343 comments
11,533 users

┬ęDataiku 2012-2018 - Privacy Policy