0 votes
When using the analyze tool for a column, is it possible to force the analysis to run on the whole dataset instead of just the current sample?

How else could I get, for example, a categorical analysis of a column for all of the data?
asked by

3 Answers

+1 vote

With DSS 1.x (I will update later my post with DSS 2.0 if there is any change), when you explore a dataset or make a preparation script, you work on a sample. As jrouquie suggested, you can change the sample size.

There is something that could help you: the Visualize tab. The normal behavior is that it works on the same sample that with the Explore tab.
But, if you are on a SQL dataset or Impala, you can change the engine and get graphs on full dataset. Read more here: http://doc.dataiku.com/dss/1.4/visualization/sampling.html#live-in-database-engine

I hope that helps.

answered by
+1 vote
This feature is now available in DSS 4.0
answered by
0 votes
There is no such control inside the Analysis dialog box. You can of course change the current sample and set it to be the whole dataset. In which case the analysis (and everything else in the preparation script) will be previewed on the whole dataset.

Note that the interface will hang of your dataset is too big (as a rule of thumb, compare to the default sample size, which is 30 000).

For datasets that fit in RAM, I would rather use the value_counts method of pandas.
answered by
365 questions
392 answers
224 comments
230 users