Too many lines in my dataset in Hive

UserBird · ‎10-18-2015

I have a dataset with free text stored in HDFS, in CSV format.

When I go to explore view, everything looks OK. However, when I query the table in the Hive notebook, I see several lines for each line of the entry. It looks like the \n in my original file are not properly escaped, and are considered as new lines.

Clément_Stenac · ‎10-18-2015

Hi,

Unfortunately, this is inherent to the way Hadoop (and therefore Hive) handle "Text files" (under which CSV fall). In order to be able to distribute the various chunks of a file, Hadoop splits the file based on \n at arbitrary offsets and cannot handle multi-line CSV fields.

When processing data on HDFS, we strongly advise to use dedicated file formats like ORC or Parquet, that provide both far better performance and better compatibility.

View solution in original post

Clément_Stenac · ‎10-18-2015

Hi,

Unfortunately, this is inherent to the way Hadoop (and therefore Hive) handle "Text files" (under which CSV fall). In order to be able to distribute the various chunks of a file, Hadoop splits the file based on \n at arbitrary offsets and cannot handle multi-line CSV fields.

When processing data on HDFS, we strongly advise to use dedicated file formats like ORC or Parquet, that provide both far better performance and better compatibility.

Too many lines in my dataset in Hive

Too many lines in my dataset in Hive

Labels

Datasets

File formats

Hadoop

Sign up to take part

Too many lines in my dataset in Hive

Too many lines in my dataset in Hive

Labels

Datasets

File formats

Hadoop