Huge time to update project permissions with Python API

Solved!
Charly
Level 2
Huge time to update project permissions with Python API

Hi everyone !

I have a lot of users on my Dataiku instance and to organize the work, I enforced for each root folder an "admin" group with access to all the project in this folder. They are intermediate support for my instance.

To give access to new projects to the admin, I wrote a code which I want to run daily, but it's weirdly time consuming. The step which is very long is the check of the permissions and the add of the admin group if it's missing :

for project_key in liste_projets:
    project = client.get_project(project_key)
    project_permissions = project.get_permissions()
    groupe_admin_existe_pas = True
    for g in project_permissions['permissions']:
        if 'group' in g and g['group'] == groupe_admin:
            groupe_admin_existe_pas=False
            break
    if groupe_admin_existe_pas:
        project_permissions['permissions'].append({'group':groupe_admin,'admin': True})
        project.set_permissions(project_permissions)

 

Is there something wrong with this code ? A project can be done in an instant but others take more than 10 or 15 seconds. I have a huge number of projects in this instance so this pile up a lot !

Thanks in advance !


Operating system used: Redhat

1 Solution
Turribeach

Well it sounds wrong to implement a slow solution just because some random user may decide to remove some admin group they shouldn't remove. In that case you can warn users not to do that. Users being users may still do it but you can find out they done it and "retrain" them again. 

In any case there are other ways to go about. The reason the API is slow is because the project permissions are stored in the local file system as a JSON file. Add to that the overhead of the API, Python and Java and you end up with something that's not very good at doing lots of IO on small files. Guess what's good for doing lots of IO on small files? The Linux command line. Below is a command that can search through all your instance projects for a group and return a list of project keys that have it:

find {YOUR_DSS_DATA_DIR}/config/projects -name params.json -exec grep "YOUR_ADMIN_GROUP_NAME" {} + | awk -v FS="/" '{print $6}'

You may need to adjust the awk position depending on how many directory levels your data dir has, mine has 2 so use print $5 for 1 level or print $7 for 3 levels. On an instance with 1500 projects it takes a couple of seconds to run. The only other caveat is to make sure your group name is unique and doesn't have any name clashes like "Admin" and "Admin1" which will produce a false find. You can run the above in a Shell Recipe in a Flow or using subprocess in Python. Then delta against client.list_project_keys() to see which keys need to have the admin group added. 

 

 

View solution in original post

0 Kudos
4 Replies
Turribeach

I always thought it was weird Dataiku doesn't inherit folder permissions into project permissions. It does work with subfolders though. Perhaps it's to give more flexibility to the user to decide who can see their projects but certainly it's a problem for Support/Admins who need access to all projects but don't want to use the full Administrator group which give too much power in the rest of the platform.

I don't see anything wrong with your code other than it's not going to scale well giving that you continue to check permissions for projects that already have permissioned. Which brings me to the next point. Once you ran your code for all projects then you only need to take care of new projects. So use this API call:

project.get_summary()['creationTag']['lastModifiedOn']

to get the project creation date time (it's in Unix Epoch) and then only run for newer projects since the previous run of your script. This will be much faster but it is still not perfect as it still needs to iterate through all the projects to fetch the creation date time. There doesn't seem to be any APIs that allow filtering projects by creation date time or even project tags, which can be used to filter projects in the GUI. So perhaps you should raise a Product Idea for that. Being able to to obtain a list of project keys that have a specific tag will be great for admin tasks like this one although users can modify tags so not full proof.

As a workaround to keep track of which projects you already permissioned you could have a dataset to store all the project keys you already permissioned so then you will need to avoid fetching the creation date time for each and could simply do a list delta between current and stored to see what you need to process. But using creation date time should be good for now I think.

 

 

 

0 Kudos
Charly
Level 2
Author

I thought about your solution, but what if a project owner decide to delete their admin group from the group permission ? We don't want the process of creating and sharing project to be too long for end users, they are project owner then. This check seems inevitable.

What I hoped is a way to get only a part of a project, check only permissions. The difference in the time of process between project is probably due to the amount of information the API collect to access a project (I'm not completely sure I have to admit).

The idea of filtering project is good though, even if it's not for that part of my use case (previously, I created a recursive function to get all projects from a root folders and all its subfolders, it would have been easier with a tag).

Alternatively, do you know a way to enforce a tag or a permission by platforms administrators that project owner can't remove ?

Turribeach

Well it sounds wrong to implement a slow solution just because some random user may decide to remove some admin group they shouldn't remove. In that case you can warn users not to do that. Users being users may still do it but you can find out they done it and "retrain" them again. 

In any case there are other ways to go about. The reason the API is slow is because the project permissions are stored in the local file system as a JSON file. Add to that the overhead of the API, Python and Java and you end up with something that's not very good at doing lots of IO on small files. Guess what's good for doing lots of IO on small files? The Linux command line. Below is a command that can search through all your instance projects for a group and return a list of project keys that have it:

find {YOUR_DSS_DATA_DIR}/config/projects -name params.json -exec grep "YOUR_ADMIN_GROUP_NAME" {} + | awk -v FS="/" '{print $6}'

You may need to adjust the awk position depending on how many directory levels your data dir has, mine has 2 so use print $5 for 1 level or print $7 for 3 levels. On an instance with 1500 projects it takes a couple of seconds to run. The only other caveat is to make sure your group name is unique and doesn't have any name clashes like "Admin" and "Admin1" which will produce a false find. You can run the above in a Shell Recipe in a Flow or using subprocess in Python. Then delta against client.list_project_keys() to see which keys need to have the admin group added. 

 

 

0 Kudos
Charly
Level 2
Author

Excellent idea ! I'm not used to bash in Python, I should get better habits. Thanks a lot ๐Ÿ˜ƒ

0 Kudos