Coming soon: We’re working on a brand new, revamped Community experience. Want to receive updates? Sign up now!

0 votes

There does not appear to be a way to write spark jobs to disk using a set partition scheme. This is normally done via dataframe.write.parquet(<path>, partitionBy=['year']), if one is to partition the data by year, for example. I am looking at the API page here:, specifically the function: write_with_schema

What are my options here? Since this is an important requirement for us, what's to stop me from simply using the sqlContext to write to a fixed path in HDFS, using the command I gave above? Can this be hacked somehow, or by using a plugin?

I can't seem to look up how to override the write_with_schema call. Following the instructions here: - 'spark' does not appear to be a module in the tarball (dataiku-internal-client-5.1.0). Any reason why you are trying to hide that part of the API?

1 Answer

0 votes

In order to use partitioning in Dataiku, you need to specify it on the output (and possibly input) dataset. You can find more details on this page:

If you set it up accordingly, this file system partitioning setup will be applied to all recipes, including those running on Spark.

Hope it helps,

1,337 questions
1,362 answers
11,912 users

©Dataiku 2012-2018 - Privacy Policy