0 votes
Hi I would like to use catboost (https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/). The minimum required configuration is to tell the constructor which are the categorical features.

How can I specify these correctly, when I have no access to the X matrix?

Also how can I prevent dataiku of transforming this categorical features?

Thanks for your help,


What I am trying to do is just use a custom transformer, that does nothing for the categorical features, hoping that this are then passed to catboost.fit without change. Does the order of the features in dataiku match the columns later in the dataframe during fitting? Seemingly not, but if one looks into the log it is seems that the preprocessing order of the features is the order in which these are then arranged in the X matrix. So my non-encoded features where the 3 last ones. So it works!
class Nothing(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self):
        super(Nothing, self).__init__()

    def fit(self, X, y=None):
    def transform(self, X):
        return X.to_frame()

Failed to train : <class 'cPickle.PicklingError'> : Can't pickle <class 'Nothing'>: attribute lookup __builtin__.Nothing failed

Solved the pickling error by moving the class into python libraries.

2 Answers

0 votes
If the above does not work, can use custom encoding for categorical features using a target encoder ?
Here is my target encoder ...

Unfortunately it seems that fit is only called with X ... so this will not work.

import sklearn

class TargetEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):
    def __init__(self, min_samples_leaf=1, smoothing=1, noise_level=0):

        self.dict_averages = {}
        self.dict_priors = {}

        self.min_samples_leaf = min_samples_leaf
        self.smoothing = smoothing
        self.noise_level = noise_level

    def fit(self, X, y=None):
        assert y is not None
        target = y
        self.y_col = y.name

        trn_series = X
        col = X.name

        temp = pd.concat([trn_series, target], axis=1)
        # Compute target mean
        averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
        # Compute smoothing
        smoothing = 1 / (1 + np.exp(-(averages["count"] - self.min_samples_leaf) / self.smoothing))
        # Apply average function to all target data
        prior = target.mean()
        # The bigger the count the less full_avg is taken into account
        averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
        averages.drop(["mean", "count"], axis=1, inplace=True)
        self.dict_averages.update({col: averages})
        self.dict_priors.update({col: prior})
        return self

    def transform(self, X):
        trn_series = X
        col = X.name
        ft_trn_series = pd.merge(
            self.dict_averages[col].reset_index().rename(columns={'index': self.y_col, self.y_col: 'average'}),
            on=trn_series.name, how='left')['average'].rename(trn_series.name).fillna(self.dict_priors[col])
        # pd.merge does not keep the index so restore it
        ft_trn_series.index = trn_series.index
        X = ft_trn_series
        return X
processor = TargetEncoder()
0 votes
Hi, It is not currently possible to change the way the visual ML interface of Dataiku processes categorical variables. This request has already been logged. I would advise to use the categorical variable handling of Dataiku and then catboost as a custom python model, without specific code for categorical variable handling. Otherwise, another option if you want something fully custom is to code your own processing and ML pipeline in a Python recipe/notebook. Hope it helps, Alexandre
Hi ALex,

unfortunately catboost needs as input unprocessed categorical variables. A do nothing processor in the visual ML interface does not exist.

As mentioned I could use my do nothing with catboost in the visual ML interface. But somehow during prediction the output has 0 rows.

Can you support catboost in future versions natively in the visual ML interface?
The request for custom categorical variable handling has been logged. I will log a specific request for catboost support.
1,319 questions
1,339 answers
11,888 users

©Dataiku 2012-2018 - Privacy Policy