0 votes

Hi, this is something for which I would normally make a pull request, but you don't have a public API.  I therefore thought it best if I created a bug report here instead.


If one accidentally (or by means of code) passes an object to the `write_with_schema` function in dataiku.spark that isn’t a spark dataframe, the underlying code tries to access the spark context within that assumed dataframe, and crashes with an internal Dataiku error:


[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils]  -   File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 139, in write_with_schema

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils]  -     write_schema_from_dataframe(dataset, dataframe)

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils]  -   File "/opt/dataiku-dss-5.1.2/python/dataiku/spark/__init__.py", line 122, in write_schema_from_dataframe

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils]  -     dsc = __dataikuSparkContext(dataframe._sc._jvm)

[2019/04/25-08:07:24.881] [null-out-100] [INFO] [dku.utils]  - AttributeError: 'Dataset' object has no attribute '_sc'



This can happen easily if you have a function that returns a None type which gets passed to the writer instead of a dataframe, resulting in the same kind of AttributeError.



A single line that asserts that the `dataframe` object is a spark dataframe could be added just before dataiku/spark/__init__.py line 122, where it tries to access the underlying spark context. A TypeError exception would offer a little more help to the user than the current stacktrace.


1 Answer

0 votes
Thank you very much for this report and investigating a solution. I'll pass this information to the development team.
1,299 questions
1,327 answers
11,867 users

©Dataiku 2012-2018 - Privacy Policy