0 votes
asked by

1 Answer

0 votes
This means that somewhere in the file, there is a starting quote but no ending quote, like this:

42,"a properly quoted field",3.14159
43,"an unterminated quote,2.71828

A single quotation problem can trigger the skipping of a large chunk of your dataset, considered to be just one cell (because DSS can't decide, when encountering a newline character, whether this is the start of a new row of data, or if it is part of a quoted field with a quotation problem). So the best solution is to correct the quotation problem.

You should check that DSS detected the "Separator" and the "Quoting character" right. You can also try changing the "Quoting style". See https://doc.dataiku.com/dss/latest/formats/csv.html .

You can try splitting the file into parts, and see which part causes problems while loading. Iterating this dichotomy, you might find one problematic line.

If you're sure that no field in your data can contain a newline, (for instance, there is a newline in the address field here:
  name,address,age
  Smith,"1 rue de la Paix
     , 75002 Paris",42
  Doe,"2 avenue des Champs Elysées
      75008 Paris",73
) then you might ssh to the server where the datafile resides, and copy-paste this:
   awk '{line=$0; gsub(\"[^\\\"]\",\"\"); if(length % 2 == 1){print NR; print line}}'  datafile.csv
This little awk program:
 - saves the current line($0) in variable \"line\"
 - ignores all chars that aren't a double quote (with gsub)
 - count the number of double quotes remaining on the line
 - if this number is odd, print the line number (NR) and the whole line. This should help pinpoint the problematic line(s).

As a last resort, to help to pinpoint where in the input file there is an unterminated quote, you could set the quoting style to “No escaping nor quoting”, which should allow to load the dataset, even if some lines might be wrong. While exploring this dataset, changing the sampling (potentially loading the whole dataset, if the server has enough memory) and looking at the last columns should show it fully empty or not, depending on the sample.
Filter to see only the empty lines or the non empty lines, which should direct you to the problematic line far from the beginning.

 

Once you have found the problematic line, correct it directly in the input file.
answered by
923 questions
956 answers
953 comments
1,782 users