Custom check to determine if a columns data is unique (does not have duplicates)

Solved!
joshlk
Level 1
Custom check to determine if a columns data is unique (does not have duplicates)

I would like to run a check that fails if a column "col1" in my dataset has duplicate values. In the metrics tab I am running the "Distinct value count" on col1 and "Records Counts" on the table. How do I write a custom Python check to determine if the "Distinct value count" on col1 equals "Records Counts" to determine if col1 is unique?

0 Kudos
1 Solution
Alex_Combessie
Dataiker Alumni

Hi,



Here is an example of such a Python check:




# Define here a function that returns the outcome of the check.
def process(last_values, dataset, partition_id):
# last_values is a dict of the last values of the metrics,
# with the values as a dataiku.metrics.MetricDataPoint.
# dataset is a dataiku.Dataset object
#count_record = last_values["records:COUNT_RECORDS"]["raw"]["value"]
#count_distinct =
if last_values["records:COUNT_RECORDS"].get_value()== last_values["col_stats:COUNT_DISTINCT:<PUT_YOUR_COLUMN_NAME_HERE>"].get_value():
return('OK', "no duplicate")
else:
return("ERROR", "duplicates")


[EDIT] I had forgotten to call the get_value() method on last_values["..."]

View solution in original post

2 Replies
Alex_Combessie
Dataiker Alumni

Hi,



Here is an example of such a Python check:




# Define here a function that returns the outcome of the check.
def process(last_values, dataset, partition_id):
# last_values is a dict of the last values of the metrics,
# with the values as a dataiku.metrics.MetricDataPoint.
# dataset is a dataiku.Dataset object
#count_record = last_values["records:COUNT_RECORDS"]["raw"]["value"]
#count_distinct =
if last_values["records:COUNT_RECORDS"].get_value()== last_values["col_stats:COUNT_DISTINCT:<PUT_YOUR_COLUMN_NAME_HERE>"].get_value():
return('OK', "no duplicate")
else:
return("ERROR", "duplicates")


[EDIT] I had forgotten to call the get_value() method on last_values["..."]

Droid
Level 1

Thanks Alex, worked like a charm. If I am not mistaken, it's necessary to have the distinct count metric calculated for that column for each run as well in order to make this work.

0 Kudos

Labels

?
Labels (2)
A banner prompting to get Dataiku