Using catboost as custom python model

tjh · ‎02-17-2019

Hi I would like to use catboost (https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/). The minimum required configuration is to tell the constructor which are the categorical features.

How can I specify these correctly, when I have no access to the X matrix?

Also how can I prevent dataiku of transforming this categorical features?

Thanks for your help,

Thomas.

tjh · ‎02-17-2019

If the above does not work, can use custom encoding for categorical features using a target encoder ?

tjh · ‎02-17-2019

Here is my target encoder ...

Unfortunately it seems that fit is only called with X ... so this will not work.

import sklearn

class TargetEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
def __init__(self, min_samples_leaf=1, smoothing=1, noise_level=0):

self.dict_averages = {}
self.dict_priors = {}

self.min_samples_leaf = min_samples_leaf
self.smoothing = smoothing
self.noise_level = noise_level

def fit(self, X, y=None):
assert y is not None
target = y
self.y_col = y.name

trn_series = X
col = X.name

temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
self.dict_averages.update({col: averages})
self.dict_priors.update({col: prior})
return self

def transform(self, X):
trn_series = X
col = X.name
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
self.dict_averages[col].reset_index().rename(columns={'index': self.y_col, self.y_col: 'average'}),
on=trn_series.name, how='left')['average'].rename(trn_series.name).fillna(self.dict_priors[col])
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
X = ft_trn_series
return X

processor = TargetEncoder()

Alex_Combessie · ‎02-18-2019

Hi, It is not currently possible to change the way the visual ML interface of Dataiku processes categorical variables. This request has already been logged. I would advise to use the categorical variable handling of Dataiku and then catboost as a custom python model, without specific code for categorical variable handling. Otherwise, another option if you want something fully custom is to code your own processing and ML pipeline in a Python recipe/notebook. Hope it helps, Alexandre

tjh · ‎02-19-2019

Hi ALex,

unfortunately catboost needs as input unprocessed categorical variables. A do nothing processor in the visual ML interface does not exist.

As mentioned I could use my do nothing with catboost in the visual ML interface. But somehow during prediction the output has 0 rows.

Can you support catboost in future versions natively in the visual ML interface?

Alex_Combessie · ‎02-19-2019

The request for custom categorical variable handling has been logged. I will log a specific request for catboost support.

OrsonWelles · ‎09-15-2020

Dear Alex,

Did you manage to solve this issue since then ?

Thanks best

Using catboost as custom python model

Using catboost as custom python model

Labels

Advanced ML

code

Machine Learning

Sign up to take part

Using catboost as custom python model

Using catboost as custom python model

Labels

Advanced ML

code

Machine Learning