Q & A
Dataiku is for…
Governance & Security
Learn Dataiku DSS
Q & A
Ask a Question
Email or Username
I forgot my password
Importing a hierarchical file system from an ec2 connection
I have an ec2 connection that connects to a hierarchical file system that contains many sublevels and 100+ bottom-level files. I would like to import all of these into a DSS project, preferably while keeping the hierarchy, or if that's not possible naming each file with the complete path (e.g. root/folder/file1 becomes root_folder_file1), or if that's not possible just mass importing all of the bottom-level files (file1, file2, file3...). Functionally, I want to click on the root folder and hit "import all".
I do not see a "mass dataset creation" option in the Connections page of the Administrator panel. When I try to import a folder, it appears to stack all datasets that match the schema of the first dataset (e.g. folder contains file1 and file2, if file1 and file2 contain the same column headers it combines them into the stacked file1+2) and ignore the rest of the files in the folder that do not match the schema of the first dataset (e.g. if file1 and file2 have different schemas, it will only upload file1 and will ignore file2). Whatever the actual behavior is, it definitely doesn't import everything and keep the hierarchy.
Is it possible to do this?
Thanks so much!
Jul 1, 2016
You would like each file to become it's own dataset (thus getting hundreds of datasets)? Or would you just like to be able to access all of the files ?
Aug 2, 2016
We want each file to become its own dataset separate from the connection's file (thus getting hundreds of datasets). However, we want to keep some semblance of the file system - we don't just want 100 flat datasets, if that makes sense.
Aug 3, 2016
We do have a notion of dataset:
Is that what you want?
We don't yet have a notion of hierarchical datasets
Aug 18, 2016
to add a comment.
to answer this question.
PostgreSQL SSL Connection on EC2
Create a Dataset from an excel file in Python recipe
Import multiple files in a "managed folder" and create an "original dataset" column containing the file name
How do I fetch date from a filename and add it to a column in the file using DSS?
S3 output file name
We’re working on a brand new, revamped Community experience. Want to receive updates?
Sign up now!
©Dataiku 2012-2018 -