Survey banner
Switching to Dataiku - a new area to help users who are transitioning from other tools and diving into Dataiku! CHECK IT OUT

Train the model and automatically deploy the winning algorithm based on the metrics defined.

Tsurapaneni
Level 3
Train the model and automatically deploy the winning algorithm based on the metrics defined.

Hi Team,

I have a use case where for each iteration, I have to train the same model with different datasets and make the model automatically deploy the winning model based on the metric - the Algorithm used here is K means clustering in auto ML. 

To better illustrate the scenario, I am giving an example below.

Dataset 1 ---> Clustering model A  (winning model is at K = 10) ---> deploy the model for k = 10.

Dataset 2 ---> Clustering model A  (winning model is at K = 3) ---> deploy the model for k = 3.

Dataset 3 ---> Clustering model A  (winning model is at K = 5) ---> deploy the model for k = 5.

At the end I want to merge all the outputs of the deployed model. 

 

Thank you in advance !

1 Reply
pmasiphelps
Dataiker

Hi,

 

The below code uses the python API to train K-means models for different values of K, then deploys the best performing model to the flow. Then it creates a scoring recipe to generate cluster labels for all records in the input dataset.

 

You can wrap it in a for loop to apply to multiple input datasets. You'll have to change the dataset names, feature names, values of K to try, and desired performance metric.

import dataiku
from dataiku import pandasutils as pdu
from dataikuapi.dss.recipe import ClusteringScoringRecipeCreator
import pandas as pd
import numpy as np

client= dataiku.api_client()
project= client.get_project('PROJECT_KEY')

#Replace with your input dataset - or can wrap the below code in a for loop on multiple datasets
ml_dataset = "INPUT_DATASET_NAME"

#Create a new ML clustering task on input dataset
mltask = project.create_clustering_ml_task(
    input_dataset=ml_dataset,
    ml_backend_type='PY_MEMORY', # ML backend to use
    guess_policy='KMEANS', # Template to use for setting default parameters
    wait_guess_complete=True
)

settings = mltask.get_settings()

#Set K and other clustering settings
algorithm_settings = settings.get_algorithm_settings('KMEANS')
algorithm_settings['k'] = [3,5,6]

settings.mltask_settings['preprocessing']['outliers']['method'] = 'DROP'
settings.mltask_settings['preprocessing']['outliers']['min_cum_ratio'] = 0.05
settings.mltask_settings['preprocessing']['outliers']['min_n'] = 2

settings.save()

#Turn features on/off
for feature in features:
    settings.reject_feature(feature)
settings.use_feature('FEATURE_1')
settings.use_feature('FEATURE_2')
settings.save()


#Train ML task
mltask.start_train()
mltask.wait_train_complete()

ids = mltask.get_trained_models_ids()

#Find highest scoring cluster 
scores=[]
count = 0
for id in ids:
    details = mltask.get_trained_model_details(id)
    algorithm = details.get_modeling_settings()

    #Replace with your chosen metric
    performance = details.get_performance_metrics()['silhouette']

    scores.append({"model_id": id,
                   "performance": performance,
                   "count": count})
    print('-----------Performance-----------')
    print(performance)
    count+=1

scores = sorted(scores, key=lambda k: k['performance'], reverse=True) 

best_model_id = scores[0]['model_id']

#Deploy the best model to the flow
ret = mltask.deploy_to_flow(best_model_id, "{}_model".format(ml_dataset), ml_dataset)
model_id = ret["savedModelId"]

#Use a scoring recipe with the training dataset and best model to generate cluster labels
builder = ClusteringScoringRecipeCreator("{}_scoring_recipe".format(ml_dataset), project)
builder.with_input_model(best_model_id)
builder.with_input(ml_dataset)

builder.with_new_output("{}_scored".format(ml_dataset),"filesystem_managed", format_option_id="CSV_EXCEL_GZIP")

cluster_recipe = builder.build()
cluster_recipe.compute_schema_updates().apply()
cluster_recipe.run()

 

Best,

Pat