Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have an input dataset which contains one column of JSON data, which needs to be:
This works fine when the output data set is "filesystem_managed", but if the output data set is "HDFS_managed", I get a long list of errors and no output. From the log I see:
[11:56:34] [WARN] [com.dataiku.dip.input.formats.parquet.ParquetOutputWriter] - OUTPUT_DATA_BAD_TYPE: Unable to write row 3 to Parquet: Failed to write in column usage (content:{"duration":4,"start_time":1495161529377,"package_name":"com.sonyericsson.home","count":1,"origin_google_play":false}): A JSONArray text must start with '[' at 1 [character 2 line 1]
java.io.IOException: Failed to write in column usage (content:{"duration":4,"start_time":1495161529377,"package_name":"com.sonyericsson.home","count":1,"origin_google_play":false}): A JSONArray text must start with '[' at 1 [character 2 line 1]
Since the JSON data is read correctly from the origin file set, it seems strange that it cannot be written back in the same way. However, I wounder about the error text above "A JSONArray text must start with '[' at 1 [character 2 line 1]". I have no idea how the string is stored, but it appears that index "1" is character "2". In the string I have, the FIRST character is "[", so could it be that there is some mismatch between how the string is stored internally and how the write function is implemented? Obviously, since the original data can be unpacked with the "unnest" processor, at least that function does not have any problems interpreting the JSON data correctly, so the issue seems to be with the "write to hdfs_managed" functionality.