python API dataset/recipe creation

steven · ‎06-21-2018

Hello ! I'm back with another question on the API 😉

Here the thing: I am building an entire project throw the python API. First I build the datasets:



new_dataset = new_project.create_dataset(dataset_name=dataset_name, 
                           type="HDFS",
                           params={
                              'connection': connection
                              ,'path': "/" + dataset_name
                              ,'hiveDatabase': connection
                              ,u'hiveTableName': dataset_name
                              ,u'metastoreSynchronizationEnabled': True
                            }, 
                           formatType='orcfile'
                          )

+ adding a schema from datasets in another project. Everything works fine here.

Then I build the recipes. Since the code is really long I won't post it here. But here the ideas:

I choose a name, and recipe type, then use the CodeRecipeCreator. Then I select some input(s) and output(s), and finally build the recipe (recipe_builder_object.build() ) . Last step I put some code in the definition and payload object.

Ok everything worked pretty fine! Python/R/Hive/Stack... recipes are working, put the data in the right dataset etc...

And then I tried some Spark Recipe (Pyspark/SparkR). The recipe seems okay, but when I try to run it in my project, I get the following error:


[Exec-61] [INFO] [dku.utils]  - : org.apache.hadoop.mapred.FileAlreadyExistsException: 'Some_path_to_hdfs/dataset_name' already exists.

I just saw that it's not about the recipe type, but how do you write the output. If you write the pandas dataframe with dataiku.Dataset("dataset_name").write_with_schema(dataset_schema) the Spark recipe works. But instead if you're working with the Spark version:

ds_fac_output = dataiku.Dataset(dataset_output_name)

dkuspark.write_with_schema(ds_fac_output, spark_df)

So I looked into the parameters of the dataset/recipe, and I can't figure out why this isn't working. As far as I know, I am working in the 'erase' mode, not the 'append'. If I manually delete the directory containing the ORC dataset on HDFS, the recipe works, but only ONCE. If I run it again, I still get the same error.

Hoping I didn't omit anything! 🙂

Thanks

steven · ‎06-28-2018

Hi,

Seems like it's yarn-large. But I tried to change it, doesn't change the error 😕

python API dataset/recipe creation

python API dataset/recipe creation

Labels

API

Datasets

Python

Spark

Sign up to take part

python API dataset/recipe creation

python API dataset/recipe creation

Labels

API

Datasets

Python

Spark