Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
For each row in my dataset I'd like to know which row is most complete.
Does anyone know a way to count the number of columns for each row that have values. Maybe without enumerating each column. Right now I have a formula that looks for blank values. If the cell is no-blank it counts as one.
if(isBlank(Constituency),0,1)
+if(isBlank(Date),0,1)
+if(isBlank(Amount),0,1)
+if(isBlank(Category),0,1)
+if(isBlank(val("\(No designation\) 2022")),0,1)
+if(isBlank(val("\(No designation\) 2023")),0,1)
+if(isBlank(val("\(No designation\) 2019")),0,1)
+if(isBlank(val("\(No designation\) 2019")),0,1)
Operating system used: Ubuntu 18.04 -> WSL2 -> Windows 11
Another way to achieve this is using below custom Python function that includes the check if there are any empty strings:
def process(row):
col=0
for val in row.values():
if val is not None and val!="":
col=col+1
return col
Hi Tom,
Im not sure if this can be done in a formula.
But one other option you probably can try is using a Python Function in the Prepare recipe
https://knowledge.dataiku.com/latest/kb/data-prep/prepare-recipe/python.html
You can keep the Mode as Cell
and in the source code something like below should work
def process(row):
return sum(1 for i in row.values() if i is not None)
Another way to achieve this is using below custom Python function that includes the check if there are any empty strings:
def process(row):
col=0
for val in row.values():
if val is not None and val!="":
col=col+1
return col
Iโll check these out when I have a moment. Thanks.
This will be a great as part of a technique Iโm exploring to try figure out which record to keep when Iโm working on de-duping records. I want to keep the record with the richest set of additional data.
Your suggested solution seems to work fairly well for me.
Thanks.
I do wonder if @NN solution might be more performant. However, my current dataset is fairly small at a few thousand records.
P.S. I hope others find this useful as well.