0 votes

I am trying to use Dataïku to evaluate an implementation of a recommender system, but it is hard to represent this RS in a standard ML problem.

To give you the scale of the data, I have about 6M transactions of 4M users on 200K products. Products have features, some of them come from text analysis so I could have a few thousand columns. I am trying to use a content-based recommender, therefore in terms of machine learning my problem can be expressed as follows:

product features x user -> purchase or not

Here is the issue : it is not efficient to represent it that way. Features require thousands of columns. The combination of products and users requires 4Mx200K rows, which is 800 billions, so that's not really possible.

I created a program that takes advantage of the sparsity of this matrix, and outputs recommendations from a trained model, similar to a cosine similarity RS. However, as far as I know, analyzing an algorithm of machine learning in DSS using a plugin requires it to be formatted in the really unpractical way that I exposed earlier.

The Dataiku interface is very relevant to evaluate my solution. I intend to use various precision metrics and avoiding to implement again all of that would save me a lot of work.

Is there a way for me to use the Dataiku interface anyway?



Jean Creusefond
asked by Jean Creusefond

1 Answer

+3 votes
Hello Jean,

Indeed your data seems to be too large to recommend products using a "brute force" scoring of all users vs all products.


1) a meta model can still be used to combine different rankings or recommender engines. Let's say, you have a content based filtering based on similarities of purchases based on descriptions, a collaborative filtering one and one based on product images similarities. Each can be efficiently calculated with Spark or Hadoop technologies.

Now, since you have all purchases of all clients you can generate positives examples of purchases. You would still need some 0s. Taking everything is way to large so wha you can do is sample 5 or 10 negative examples per client that purchased something. This strategy can vary depending on your context, for example you can select 5 products randomly if you have few information but you could also select 5 products the client was exposed to but did not purchase (providing you can construct a proxy of exposition).

This would allow you to  train a model (for ex a logistic regression) to combine your different models. In the same way you could create any features and put that in the model (still with sampling done on the 0 cases).

Now the issue you would face is that scoring time may be too big (all customers against all products => big table). So you restrict only to the first K products / clients coming out out of the content based recommender system to be able to scale. This approach is called re-ranking.

I would start with a content based recommender system and compare it to other versions. This would give a minimal viable product. Then iterate to create more complex models such as the reranking one I briefly described.

2) from what I understood, you would still want to have a look at the performances metrics and graphs available in the ML tool in DSS not to have to code them yourselves. You can still keep the 1 (purchases) and randomly pick 0s to create a fake training dataset. You should have some score (similarities from content based) given to every tuple (client,product). You can then either calculate performances metrics through python or sql... or you can do a dirty hack and create a logistic regression model in DSS taking only the similarity score as features and get all the graphs and metrics for a model that give the same priority (in terms of AUC and lift, NOT log loss or f1 score due to change in probabilities scale) to similarities.

Hope this helps. Contact me for more details / information.

answered by
By the way. I did a presentation on the topic you can find here: https://www.slideshare.net/PierreGutierrez2/from-labelling-open-data-images-to-building-a-private-recommender-system

Notice that in this particular case we don't have a problem at scoring time because we recommend "sales"  on the website that are opened the next day = around 700 sales.

Have a good day,

710 questions
729 answers
461 users