Coming soon: We’re working on a brand new, revamped Community experience. Want to receive updates? Sign up now!

0 votes
I am currently leading a statistical analysis on absenteism data. In this study, I am studying the influence of multiple factors on employees' presence at work. But anytime i use the logistic regression i can't get p-values for the factors' coefficents (except when I use a PCA to reduce the dimension but in that case I can't interpret the results, which does not serve my case either)

Does anyone know how to recover that on Dataiku?



1 Answer

0 votes
Best answer

DSS only shows p-values when there are less than 1000 coefficients (after preprocessing - so each categorical value becomes a coefficient). Even if you have less than 1000 coefficients, computing p-values is not always possible due to numerical issues.

Beware that logistic regression in DSS is always regularized, and p-values are not strictly defined for regularized regressions
selected by
Thank your for that (really) quick answer. However I only have 14 columns, with 52 categorical values in total so I am guessing that i'm facing those "numerical issues".

Could you explain what they are and how to get around?

Many thanks
If you want to use p-values for rigorous statistical tests, I would advise using a logistic regression library which does not apply regularization. The scikit-learn version we use in the visual machine learning feature is regularized, which is better for classification performance, but less so for interpretability.
There is a Python implementation for unregularized logistic regression (a.k.a. logit) in the library statsmodel. Alternatively, you could use many R packages such as glm.
1,337 questions
1,362 answers
11,912 users

©Dataiku 2012-2018 - Privacy Policy