Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi there,
I am looking for a way to get the database keys and names that are shared into my project using the Dataiku API.
I tried the following:
project = client.get_project('PROJECT_NAME')
datasets = project.list_datasets()
When using datasets[index_of_database]['params']['table'], then I get the name of a database.
However, the API call does not include databases which are shared into my project.
Background of this is to find dependencies of projects (e.g. if database A is shared into project B, then project A needs to be built first)
I am looking forward to your help.
Best,
Oliver
Hi, this code snippet can help you get the list of shared datasets + their connections.
client = dataiku.api_client()
for project_key in client.list_project_keys():
print "*** EXPOSED FROM PROJECT %s ***" % (project_key)
p = client.get_project(project_key)
for exposed_object in p.get_settings().get_raw()["exposedObjects"]["objects"]:
connection = p.get_dataset(exposed_object["localName"]).get_definition().get('params').get('connection')
print " Object id=%s type=%s db=%s is exposed to projects:" % (exposed_object["localName"], exposed_object["type"], connection)
for rule in exposed_object["rules"]:
print " %s" % rule["targetProject"]
Cheers,
Hi, this code snippet can help you get the list of shared datasets + their connections.
client = dataiku.api_client()
for project_key in client.list_project_keys():
print "*** EXPOSED FROM PROJECT %s ***" % (project_key)
p = client.get_project(project_key)
for exposed_object in p.get_settings().get_raw()["exposedObjects"]["objects"]:
connection = p.get_dataset(exposed_object["localName"]).get_definition().get('params').get('connection')
print " Object id=%s type=%s db=%s is exposed to projects:" % (exposed_object["localName"], exposed_object["type"], connection)
for rule in exposed_object["rules"]:
print " %s" % rule["targetProject"]
Cheers,
If you want to check if the shared (exported) dataset is used in downstream (i.e. is an input of a recipe in the other project) you can use something like this:
def get_shared_datasets(client, project_key=None, direction='from'):
# Returns all the shared dataset
# 1. from a given project (direction = from)
# i.e. it returns all the datasets that are exported(shared) from this project
# and are used. So for example if DS1 is exported from PRJA to PRJB
# it is reported only if in PRJB there is a recipe reading PRJA.DS1.
# 2. or to a given project (direction = to)
# i.e. it returns all the datasets that are imported to this project
# and are used. So for example if DS is imported from PRJB to PRJA
# it is reported only if in PRJA there is a recipe reading PRJB.DS1
# project_key can be <str> or <list> of <str>
# If project_key is None, then returns exported datasets from every project
# Result is a dict with structure:
# {u'PROJECT_KEY_A':
# {u'dataset_A': [u'CHILD_PROJECT_A'],
# u'dataset_B': [u'CHILD_PROJECT_A',u'CHILD_PROJECT_B'],
# ... },
# u'PROJECT_KEY_B':
# { .. }
# }
# client = dataiku.api_client()
projects = []
if isinstance(project_key, str):
projects = [project_key]
if isinstance(project_key, list):
projects = project_key
patt = re.compile('\w+\.\w+')
shared_datasets = {}
for project in client.list_projects():
prj = client.get_project(project['projectKey'])
for r in prj.list_recipes():
if 'inputs' in r:
if 'main' in r['inputs']:
if 'items' in r['inputs']['main']:
for inp in r['inputs']['main']['items']:
if patt.match(inp['ref']):
proj_ds = inp['ref'].split('.')
if project_key is None or (proj_ds[0] in projects and direction == 'from') or\
(project['projectKey'] in projects and direction == 'to'):
if proj_ds[0] not in shared_datasets:
shared_datasets[proj_ds[0]] = {}
if proj_ds[1] not in shared_datasets[proj_ds[0]]:
shared_datasets[proj_ds[0]][proj_ds[1]] = []
if project['projectKey'] not in shared_datasets[proj_ds[0]][proj_ds[1]]:
shared_datasets[proj_ds[0]][proj_ds[1]].append(project['projectKey'])
return shared_datasets