Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a Hive Dataset consisting RAW DATA with timestamp in string type
RAW DATA
Data table in it is as below
I want to partition this table based on the time stamp column. to the day level.into parquet files.
how can we do it?
Hi @sasidharp ,
Sorry for the delayed answer. Found here a recording of the process I followed to do what you are asking for (a video is worth 1000 pictures and a picture is worth...):
There is a couple of things to notice:
1) The dataset is a copy of your screenshot, so there is very few data!
2) When I create the preparation recipe to parse the data, in the output I choose an HDFS dataset (later it will be synchronized automatically to the Hive metastore), and then I select CSV (Hive): this is only the case because our hadoop ecosystem doesn't have installed the parquet libraries. But you should be able to select Parquet instead.
3) When I run the recipe, you can see that I select a random specific day. For some reason I can't explain (some dataikuer around that could help?) that is enough to generate all the partitions for all days in your source data. You can check that at the end.
If something fails, I'm not sure I would be able to help, and I would recommend to ask Dataiku directly for support. With HDFS and Hive there are always specificities for each installation.
Hope this helps!
1) GO to connections and click on activate partitioning
2)In Configure dimensions select Dimension type as Time Range
3) In Range Choose Day
In SQL Query change date format
Eg: select *
from example
where DATE_FORMAT(effective_dt, 'yyyy-MM-dd') = '${day}'
In Query listing partitions:
SELECT day FROM (
SELECT DATE_FORMAT(effective_dt, 'yyyy-MM-dd') as day FROM example) subq group by day;
In Partition for preview
give one date eg: 2017-12-31
The Query is not giving any data
the time_stamp is in string format, what should we do to change it to date ?
So, just to be sure, besides partitioning the data at the day level, you want to move it into parquet files? Or just do the DSS partition?
In any case the solution given by @GMR is perfect, once you deal with the Date parsing issue: you have to transform first the time_stamp string into a Date.
I want to partition the table and move them into paraquet files,
if i use prepare recipe to parse the date, can i give the output as partitioned dataset? stored to hdfs managed folder.
Hi @sasidharp ,
Sorry for the delayed answer. Found here a recording of the process I followed to do what you are asking for (a video is worth 1000 pictures and a picture is worth...):
There is a couple of things to notice:
1) The dataset is a copy of your screenshot, so there is very few data!
2) When I create the preparation recipe to parse the data, in the output I choose an HDFS dataset (later it will be synchronized automatically to the Hive metastore), and then I select CSV (Hive): this is only the case because our hadoop ecosystem doesn't have installed the parquet libraries. But you should be able to select Parquet instead.
3) When I run the recipe, you can see that I select a random specific day. For some reason I can't explain (some dataikuer around that could help?) that is enough to generate all the partitions for all days in your source data. You can check that at the end.
If something fails, I'm not sure I would be able to help, and I would recommend to ask Dataiku directly for support. With HDFS and Hive there are always specificities for each installation.
Hope this helps!