0 votes
Hi!

I created a model using built-in Dataiku models. However, results are quite suspicious. So I would like to ask you some questions.

In the attached screenshot you can see that the model I created is a Decision tree using 10 fold CV. This model was created only for testing purposes so I intentionally set tree max depth to 100. This makes the tree very deep (I could see that in the Interpretation part). Also this model should be overfitting a lot. It should perform very well on train set and very bad on test set. In case of cross validation it should have a bad performance as well because we should be evaluating results on each untrained fold. However, we can see 0.892 AUC here. Can you explain why we get this kind of performance which is obviously not right for this model? And on which data exactly this ROC AUC in the centre is calculated?

 

Povilas
asked by
Forgot to add screenshot

https://ibb.co/grer6n

1 Answer

0 votes
The resulting metrics is the average of the metric on each of the 10 folds (each time on the untrained fold).

What kind of performance are you getting:
* Without K-Fold ?
* On a reasonably-sized decision tree ?
* On a reasonable-sized random forest ?
answered by
* Without k-fold (simple train-test split) I get very similar performance. Again it is a very deep tree and I guess it is not a performance on test set but on train set
* Decision tree with max depth = 5 gives 0.64 AUC
* Typical Random Forrest gives 0.95 (!)

I tried same dataset using code for XGBoost and GBM. It gives not more than 0.75 AUC using 10 fold CV
792 questions
816 answers
720 comments
533 users