0 votes
Hi, I have a dataset where some combination of columns results in a unique key:

1    A

1    B

2    B

2    C

3    A

The first column with numerical values represents users and the second column with characters are their transactions. Each user has a unique identifier, but it is repeated in the file for every transaction.

I am trying to create a dataset where all users that purchased a certain item, for instance A, are completely removed. If I just delete the rows containing A (e.g. using a Prepare recipe), User 1 is still in the dataset via purchase B. I had a thought that this could be accomplished by identifying the list of unique users who had purchased A, and then using that list of users as a filter on the original dataset. I am not sure of the best way to accomplish this -- it seemed like the simplest way might be to do something like a left join, based on the filtered list, but set the condition that the output dataset REMOVES the rows that match.

Is this possible, or is there a better method? Or can this only be accomplished using SQL?
asked by

1 Answer

0 votes
Best answer


What you describe is often described as the "ANTI JOIN" concept in SQL.  I would recommend indeed to accomplish this using a SQL script recipe in Dataiku.

Unfortunately, there is not a common syntax for ANTI JOINs in SQL, as for regular LEFT/RIGHT/INNER/... JOINS. Depending on the flavour of SQL database you are using, you can look up the specific syntax you need. For instance, in PostgreSQL you can use a syntax like this:

FROM actor a
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id

This is documented online in this article: https://blog.jooq.org/2017/01/12/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql/. For other SQL flavours, similar articles certainly exist to help you write the right script.

Hope it helps,


answered by
selected by
Thank you, Alex.
994 questions
1,023 answers
3,027 users

┬ęDataiku 2012-2018 - Privacy Policy