0 votes

When I try to process my CSV dataset on HDFS using Spark, I get error messages "java.io.IOException: Unterminated quoted field at the end of the file"

What is the reason ?
asked by

1 Answer

0 votes

Your dataset probably has multi-line records, which cannot be processed in Spark.

Spark and Hadoop work by cutting input data files in segments and processing them in parallel. For CSV files, they cut at an arbitrary point in the file and look for an end-of-line and start processing from here.

Thus, it is not really possible to process multi-line records in Spark (or Hadoop), since it might cut at the wrong place. We strongly recommend that you start by syncing your CSV dataset to a Parquet or ORC one (using the local DSS engine instead of Hadoop or Spark). As soon as you are on a "non-textual" format, you won't have issues anymore.

Alternatively, this could also be caused by invalid quoting style: see http://answers.dataiku.com/561/unterminated-quoted-field-at-the-end-of-the-file
answered by
991 questions
1,024 answers
3,140 users

©Dataiku 2012-2018 - Privacy Policy