0 votes
Hi everybody,

I'm traying to  filter a table of 2.7M rows in order to have a sample .

Here what I did :

- I create a filter

- I chose : Filter ON

-Keep only rows that satisfy : All the following conditions

- I put the condition

- For the sampling : I chose Whole data

When I run ; my filter doesn't return any row when it's suppose to do

What is the problem ???

Thanks in advance
asked by

3 Answers

+1 vote
Best answer

Hello,

Thanks for the diagnosis. After investigation, it seems the issue was caused by a discrepancy between lowercase and uppercase in your original Parquet file versus the Hive table. Your input dataset was generated as a Parquet file manually with the column name "MANDT" (uppercase). Then it was imported from Hive to DSS. However, Hive always converts all column names to lowercase. Hence, DSS was seeing the column name as "mandt" which is incoherent to the name stored in the original Parquet file. As of today we cannot detect this type of cases automatically.

The preferred solution would be to only generate Parquet files with lowercase column names, so that they are compatible with Hive (and Impala as well). 

If that option is not possible, you may try to change the recipe engine from DSS to Hive. As a matter of fact, for large datasets, it is recommended to change the recipe engine to a Hadoop related one (Spark, Hive or Impala). You should gain in performance by pushing the computation down to your Hadoop cluster instead of having it streamed to DSS.

Cheers,

Alex

answered by
selected by
Hello Alex,
Thank you so much for your help , I appreciate it !
0 votes
Hello,

Could you please give us more details as to the nature and content of your filter? Have you checked that the value you are filtering on is indeed in the whole dataset?

Cheers,

Alex
answered by
Thank you alexandre for your reply,
I made multiple examples and no one works.
For example :  I selected a column named mandt, then I tried mandt equals 100  ( all the rows have the value 100 ) -> result : 0 row
I tried mandt is different from 100  -> result : 0 row
mandt is defined  -> result : 0 row
etc ....
:-(
Could you tell us how you are creating your filter? Is it a filter in the sampling definition? A step in a recipe? A filter in the view of the sample?
You can attach some screenshots to your comments to show us what you are trying to achieve.
0 votes

 

@Alex

Here are the screenshots

answered by
Could you please share the job diagnosis after you run the recipe? You can download it in the page of the job, under Actions > Download job diagnosis.
Sorry I downloaded the job diagnosis but I don't know how to share it
You can use any file transfer you want, for instance Wetransfer.
Unfortunately I can't use these websites, they are all blocked in the company where I'm doing my internship
Can you send it as an email attachment to my address ([email protected])? Side-note: if your company has subscribed to Dataiku, you can also contact our official support https://support.dataiku.com. The website answers.dataiku.com is meant for community support.
861 questions
891 answers
848 comments
1,162 users