0 votes

Hi,

I get this error message when training a classification model with MLLib

[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  - java.lang.IllegalArgumentException: No rows in train dataframe after target remap. Target empty? Type mismatch?
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionJob$$anonfun$prepare$1.apply$mcV$sp(MLLibPredictionJob.scala:216)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionJob$$anonfun$prepare$1.apply(MLLibPredictionJob.scala:212)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionJob$$anonfun$prepare$1.apply(MLLibPredictionJob.scala:212)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.ProgressListener.push(ProgressListener.scala:46)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionJob$class.prepare(MLLibPredictionJob.scala:212)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionDoctorJob$.prepare(MLLibPredictionDoctorJob.scala:20)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionDoctorJob$delayedInit$body.apply(MLLibPredictionDoctorJob.scala:72)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.SuicidalApp$$anonfun$delayedInit$1.apply$mcV$sp(package.scala:402)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at scala.App$$anonfun$main$1.apply(App.scala:71)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at scala.App$$anonfun$main$1.apply(App.scala:71)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at scala.collection.immutable.List.foreach(List.scala:318)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at scala.App$class.main(App.scala:71)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionDoctorJob$.main(MLLibPredictionDoctorJob.scala:20)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at com.dataiku.dip.spark.MLLibPredictionDoctorJob.main(MLLibPredictionDoctorJob.scala)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at java.lang.reflect.Method.invoke(Method.java:497)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:710)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
[2017/03/21-09:19:30.848] [Exec-282] [INFO] [dku.utils]  -      at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What is the problem ?

asked by anonymous

1 Answer

+1 vote
Hi,

Assuming that your target column is indeed properly filled, the most probable cause is a "boolean normalization mismatch".

If your target column has "boolean" storage type (beware, it's storage type, not meaning, see : https://doc.dataiku.com/dss/4.0/schemas/), then for mllib to work properly, it MUST contain "true" and "false" as values.

In other words, for a mllib target, if the storage type is boolean, values like "0" or "1" are not supported.

When reading CSV files, DSS supports more than just "true" and "false", it supports things like 0, 1, yes, no, ... But mllib doesn't support this. You can force DSS to convert all "non-real-boolean" values to "real-boolean" values by checking the "Normalize booleans" checkbox in the dataset format settings.
answered by
861 questions
891 answers
848 comments
1,162 users