How to access data within S3 folder using directory paths

gagaoreo
Level 1
How to access data within S3 folder using directory paths

Hello all, 

 

I am working on a project where I have to access images and files from an S3 folder. I have the folder within my flow paired with a Python recipe which performs the computation. 

I would ideally be able to employ some directory to access these files, similar to how I could with a project on my local machine to access something with a directory path.

 

Any help is much appreciated!

0 Kudos
3 Replies
Turribeach

I am not sure what you are asking here. You can create a Dataiku Managed folder in an S3 bucket then you can access the managed folder via Python. Is that what you want?

0 Kudos
gagaoreo
Level 1
Author

Apologies for the vagueness, I already have a Dataiku managed folder within the S3 bucket set up. Currently I have a Python recipe from that folder in the flow. My current road block is with the implementation of a package which requires a parameter being the path of a file within the folder. 

 

I printed the current working directory, being:

/data/dataiku/dss_data/jupyter-run/dku-workdirs/[PROJ_NAME]/notebook_editor_for_[FORMULA_NAME]/ipythondir/profile_default/db

The directory of the S3 bucket within AWS is:

AmazonS3/Buckets/[dept.]/dataiku/[PROJ_NAME]/[*folder*]

 

I'm just confused regarding the file structure of Dataiku, and how to access this folder.

Hope that cleared things up, thanks!

 

0 Kudos

In order to interact with a Dataiku managed folder you need to use the Dataiku API. Also because this code may run outside of the DSS server you should use the external API. Here is some sample code:

import dataikuapi

host = "http://localhost:11200"
apiKey = "some_key"
client = dataikuapi.DSSClient(host, apiKey)
project = client.get_project('MY_PROJECT')
folder = project.get_managed_folder("my_folder_id")
for content in folder.list_contents()['items']:
    last_modified_seconds = content["lastModified"] / 1000
    last_modified_str = datetime.fromtimestamp(last_modified_seconds).strftime("%Y-%m-%d %H:%m:%S")
    print("size=%s mtime=%s %s" % (content["size"], last_modified_str, content["path"]))

 

Full API method list here: https://developer.dataiku.com/latest/api-reference/python/managed-folders.html#dataikuapi.dss.manage...

 

 

0 Kudos