0 votes

Basically, I want DataIku to stop changing storage types just because it thinks it knows better than me. These people seem to have the same problem.

I have large tables ~30 millions of lines. For some columns, the underlying column type is string because even though almost all of the rows - including all in the sample - are numeric, the definition in the database documentation is string. In rare cases, there is actually a letter in there, crashing my recipe. I know that these columns contain strings, and I don't want DataIku to convert them to bigint.

How do I stop DataIku from doing this without manually changing the column type? I am looking for a per-project global setting, since with basically every visual recipe I am using.

IMO, optimally, DataIku should never do this by itself - it can't know what the table will hold in the future, and any users that have no idea about data storage types will be confused as to what is wrong with their recipe. Instead, suggest it to the user with a nice explanation and let him manually approve of the change. It's better to waste a bit of storage space and compute power than to create potentially hard-to-detect problems by secretly converting data types.

asked by
edited by

1 Answer

0 votes
In the visual preparation, you can have arbitrary manipulation of data (search & replace, formula, python codeā€¦), which is why DSS has to do type inference. Other visual recipes can compute the actual schema based on the resulting type of what is configured in the recipe, but there is no simple solution for visual preparation.

If the column view, you also have mass actions on columns, including setting the column type. You have the same kind of tool in the dataset's schema screen. That is admittedly manual, but faster than doing it column by column.

For an automated solution, using the public API or the internal python API, you can make a simple script that sets the string type for all columns of a given dataset, and package it in a macro for example. Then when you edit your visual preparation recipe, if it warns you that the output schema is not the same as the inferred schema, you can click Ignore so that it doesn't override the output dataset's schema, or re-run your macro afterwards before running the recipe.
answered by
891 questions
920 answers
898 comments
1,397 users