0 votes
I have an ec2 connection that connects to a hierarchical file system that contains many sublevels and 100+ bottom-level files. I would like to import all of these into a DSS project, preferably while keeping the hierarchy, or if that's not possible naming each file with the complete path (e.g. root/folder/file1 becomes root_folder_file1), or if that's not possible just mass importing all of the bottom-level files (file1, file2, file3...). Functionally, I want to click on the root folder and hit "import all".

I do not see a "mass dataset creation" option in the Connections page of the Administrator panel. When I try to import a folder, it appears to stack all datasets that match the schema of the first dataset (e.g. folder contains file1 and file2, if file1 and file2 contain the same column headers it combines them into the stacked file1+2) and ignore the rest of the files in the folder that do not match the schema of the first dataset (e.g. if file1 and file2 have different schemas, it will only upload file1 and will ignore file2). Whatever the actual behavior is, it definitely doesn't import everything and keep the hierarchy.

Is it possible to do this?

Thanks so much!
You would like each file to become it's own dataset (thus getting hundreds of datasets)? Or would you just like to be able to access all of the files ?
We want each file to become its own dataset separate from the connection's file (thus getting hundreds of datasets). However, we want to keep some semblance of the file system - we don't just want 100 flat datasets, if that makes sense.
We do have a notion of dataset: https://doc.dataiku.com/dss/latest/advanced/managed_folders.html
Is that what you want?

We don't yet have a notion of hierarchical datasets

Please log in or register to answer this question.

1,082 questions
1,126 answers
10,892 users

┬ęDataiku 2012-2018 - Privacy Policy