0 votes
I am merging two datasets, and in the result I have 3 columns of bigint/integer A, B, C, which I am using to compute a "usage percent", i.e. p= (A*B)/C and 100%>p<0%.

The script correctly deduces that the output should be double/decimal, but there are in fact no decimals; all numbers end in .0. My first reaction was that in an operation with 3 integers, the result would be (implicitly) coerced to integer and then, when put into a decimal column, all decimals would be 0. However, even if I force the columns in the schemas of the input datasets to be "decimal", this still happens.

A temporary solution is to add a factor of 100 to get the percentage, but that seems crude and is likely to cause problems further on (and i still lose any decimals in the percentages, i.e. I always get, say, 18.0, 34.0, etc) What am I missing?
asked by
Hi,

We don't reproduce this issue as the division moves to double. Do you have example values for A,B,C that cause the issues ?

Note that formulas don't actually look at the types / meanings of the columns, because at the time formulas are computed, meanings are not computed yet (everything is done in stream in the prepare engine) - so it only looks at the actual values
Here are some values, computed by the formula "(Reports_of-A*Factor)/Reporting_Units" -  in the table below, "Correct value" has been computed by exporting to Excel, from which the copy was made. As you see it is only with small sample sizes ("Reporting_Units" << 1000), and where the usage by fluke sometime gets over 100% (the sample factor for "Reports_of-A" is either 100% or 5%, which "Factor" corrects for), where I get anything but 0, and even there, there are no decimals (i.e. 13.0, instead of 13.33333 as in row 2)

Country    Year    Month     Reporting_Units    Reports_of-A    Factor    Usage    Correct value
Afghanistan    2017    1    359    7    20    0.0    0,3899721
Afghanistan    2017    2    3    2    20    13.0    13,3333333
Afghanistan    2017    2    447    4    20    0.0    0,1789709
Afghanistan    2017    1    30    1    20    0.0    0,6666667
Afghanistan    2017    2    24    1    20    0.0    0,8333333
Afghanistan    2017    1    10    1    20    2.0    2,0000000
Afghanistan    2017    3    13    10    1    0.0    0,7692308
Afghanistan    2017    1    702    3    20    0.0    0,0854701
Afghanistan    2017    3    106    1    20    0.0    0,1886792
Afghanistan    2017    3    720    10    20    0.0    0,2777778
Afghanistan    2017    1    177    2    20    0.0    0,2259887
Afghanistan    2017    1    99    1    20    0.0    0,2020202
Afghanistan    2017    2    57    1    20    0.0    0,3508772
Afghanistan    2017    3    38    14    1    0.0    0,3684211
Afghanistan    2017    2    304    2    20    0.0    0,1315789
Afghanistan    2017    1    76    4    20    1.0    1,0526316
Afghanistan    2017    3    33    2    20    1.0    1,2121212
Afghanistan    2017    1    12    1    20    1.0    1,6666667
Afghanistan    2017    3    358    8    20    0.0    0,4469274
Afghanistan    2017    1    34    1    20    0.0    0,5882353
Afghanistan    2017    2    163    2    20    0.0    0,2453988
Afghanistan    2017    1    114    2    20    0.0    0,3508772
Afghanistan    2017    2    35    1    20    0.0    0,5714286
Afghanistan    2017    3    1    1    20    20.0    20,0000000
OK I hadn't understood you were talking about the computed columns of the join recipe. The problem here comes from the fact that the formula is actually translated to SQL and so the SQL type coercion rules apply.

The easiest fix here is to do 1.0 * A * B / C, which should coerce all computation to double, not only the output
Thanks, that resolved the issue!

1 Answer

0 votes
Best answer
DataIku, hanks to your support, I now understand that formulas in a merge recipe are converted to SQL queries, which coerces in this way. By following the advice of writing the formula as p= (1.0*A*B)/C, the issue was resolved. Thanks!
answered by
selected by
651 questions
648 answers
486 comments
408 users