0 votes

Hi everyone,

I was plotting some data of a webhop sales order dataset. When I plotted the totals by week, there was one week that sorely stood out of all the other weeks in 4 years of time, which was week 1 of 2016. The total amount was a little more than double of the second highest week and thus a huge peak compared to the others in the graph.

I immediately classified this graph point as an outlier so I started investigating this particular week by narrowing the graph down to this week. Surprisingly, when I explored this week ordered by day, no outlier or total amount was equal to that given the graph. So then there must've been some other error. When I took a few more dates before and after this week, I noticed that week 1 was growing the more days I added before week 1 as stated by the calendars (January 4th until January 10th).

It was then that I noticed that week 52 of 2015 only appeared as soon as I included December 27th. When I went back to the limit of December 28th until January 10th, all data was classified as week 1. It was then that I noticed December of 2015 had a week 53.

So there we have it. The 53th of December 2015, which is December 28th until January 3th is grouped with week 1, skewing my data graph. But I can't throw the sales of this week because it is relevant and true data. The high peak and the week being outlier now isn't that surprising because between December and February are the busiest weeks for this organisation.

How am I supposed to deal with this 53th week so my graph is correct without throwing away the data of the 53th week? Am I doing something wrong or is deleting these rows of a week of sales really the only solution?

Edit, chart grouped by week:

 

 

Thanks in advance,

 

Casper

asked by
edited by
Hello, what do you use to extract the week number? Is it the visual "extract date component" processor in a Prepare recipe?
Hi Alex,

No, it's the default aggregation by the graph engine "By" column. I've updated my original post with a screenshot.
Hello,

Do you have data for every day of the week in the original dataset? Would you be able to share with us part of the dataset so we can reproduce it on our side?

Best regards,

Alex
Would be great. Do you have any e-mailaddress or some other way I could send the dataset to?
Thanks for the quick reply. I've just sent you an email with the relevant dataset.

1 Answer

+1 vote
Best answer

Hello,

Thank you for sending a sample of the data. I was able to reproduce the issue. It is a bug, which I reported to our R&D team.

In the meantime let me suggest the following workaround: add a Python processing step in a Prepare recipe with the following code:

from datetime import datetime

def process(row):
    python_date = datetime.strptime(row["created_at"], "%Y-%m-%dT%H:%M:%S.000Z")
    year = python_date.isocalendar()[0]
    weeknumber = python_date.isocalendar()[1]
    return("%s-W%s" % (year, weeknumber))

This will create a new column with the expected year and week number, which you can use for aggregation in the charts.

Here is a screenshot if that helps:

Best regards,

Alex

answered by
selected by
Great stuff, thanks. :) Hope to see this fixed in an upcoming release.
990 questions
1,023 answers
1,075 comments
2,828 users

┬ęDataiku 2012-2018 - Privacy Policy