Create Partitions on hive data.

sasidharp · ‎09-11-2020

I have a Hive Dataset consisting RAW DATA with timestamp in string type

RAW DATA

Data table in it is as below

I want to partition this table based on the time stamp column. to the day level.into parquet files.

how can we do it?

Ignacio_Toledo · ‎09-14-2020

Sorry for the delayed answer. Found here a recording of the process I followed to do what you are asking for (a video is worth 1000 pictures and a picture is worth...):

There is a couple of things to notice:

1) The dataset is a copy of your screenshot, so there is very few data!

2) When I create the preparation recipe to parse the data, in the output I choose an HDFS dataset (later it will be synchronized automatically to the Hive metastore), and then I select CSV (Hive): this is only the case because our hadoop ecosystem doesn't have installed the parquet libraries. But you should be able to select Parquet instead.

3) When I run the recipe, you can see that I select a random specific day. For some reason I can't explain (some dataikuer around that could help?) that is enough to generate all the partitions for all days in your source data. You can check that at the end.

If something fails, I'm not sure I would be able to help, and I would recommend to ask Dataiku directly for support. With HDFS and Hive there are always specificities for each installation.

Hope this helps!

View solution in original post

GMR · ‎09-11-2020

1) GO to connections and click on activate partitioning

2)In Configure dimensions select Dimension type as Time Range

3) In Range Choose Day

In SQL Query change date format

Eg: select *
from example
where DATE_FORMAT(effective_dt, 'yyyy-MM-dd') = '${day}'

In Query listing partitions:

SELECT day FROM (
SELECT DATE_FORMAT(effective_dt, 'yyyy-MM-dd') as day FROM example) subq group by day;

In Partition for preview

give one date eg: 2017-12-31

sasidharp · ‎09-11-2020

The Query is not giving any data

the time_stamp is in string format, what should we do to change it to date ?

Ignacio_Toledo · ‎09-12-2020

So, just to be sure, besides partitioning the data at the day level, you want to move it into parquet files? Or just do the DSS partition?

In any case the solution given by @GMR is perfect, once you deal with the Date parsing issue: you have to transform first the time_stamp string into a Date.

sasidharp · ‎09-12-2020

I want to partition the table and move them into paraquet files,

if i use prepare recipe to parse the date, can i give the output as partitioned dataset? stored to hdfs managed folder.

Ignacio_Toledo · ‎09-14-2020

Hi @sasidharp ,

Sorry for the delayed answer. Found here a recording of the process I followed to do what you are asking for (a video is worth 1000 pictures and a picture is worth...):

There is a couple of things to notice:

1) The dataset is a copy of your screenshot, so there is very few data!

2) When I create the preparation recipe to parse the data, in the output I choose an HDFS dataset (later it will be synchronized automatically to the Hive metastore), and then I select CSV (Hive): this is only the case because our hadoop ecosystem doesn't have installed the parquet libraries. But you should be able to select Parquet instead.

3) When I run the recipe, you can see that I select a random specific day. For some reason I can't explain (some dataikuer around that could help?) that is enough to generate all the partitions for all days in your source data. You can check that at the end.

If something fails, I'm not sure I would be able to help, and I would recommend to ask Dataiku directly for support. With HDFS and Hive there are always specificities for each installation.

Hope this helps!

Sign up to take part

Create Partitions on hive data.

Create Partitions on hive data.