0 votes

When I analyze a column in a dataset, I have the options "sample" and "whole data". On "whole data", i only get the percentages of empty vs. non-empty, in "Sample", I also get the number of unique values. I assume this is because doing the job on "Whole data" uses an approximate method like HyperLogLog? If so, what is the error rate parameters, and is there a way to get the actual distinct count without using Python?

edited by

1 Answer

0 votes

in "whole data" mode, some statistics are indeed not available, because they would lead to heavy computations, and we try to limit the statistics to those that can be computed in one or two passes over the data.

The count of distinct values is not approximated, but the median, P25 and P75 values are computed with approximate percentiles. The implementation is then dependent on the database if the dataset is SQL, on Hive or Impala if the dataset is HDFS, and is computed with t-digests using 100 bins.
The problem is, in my case there is no "distinct value count" for the "whole data" mode. I only get that for the sample subset. I added a screenshot in my original post.
the distinct value count will be in the numerical tab if your column type is numeric
No matter how I format them, I get a distinct value count only for the sample, not for the whole data set - even though I activated it in the settings...
1,325 questions
1,345 answers
11,895 users

©Dataiku 2012-2018 - Privacy Policy