Join that REMOVES matching rows

acrow6 · ‎12-07-2018

Hi, I have a dataset where some combination of columns results in a unique key:

1 A

1 B

2 B

2 C

3 A

The first column with numerical values represents users and the second column with characters are their transactions. Each user has a unique identifier, but it is repeated in the file for every transaction.

I am trying to create a dataset where all users that purchased a certain item, for instance A, are completely removed. If I just delete the rows containing A (e.g. using a Prepare recipe), User 1 is still in the dataset via purchase B. I had a thought that this could be accomplished by identifying the list of unique users who had purchased A, and then using that list of users as a filter on the original dataset. I am not sure of the best way to accomplish this -- it seemed like the simplest way might be to do something like a left join, based on the filtered list, but set the condition that the output dataset REMOVES the rows that match.

Is this possible, or is there a better method? Or can this only be accomplished using SQL?

Alex_Combessie · ‎12-10-2018

Hi,

What you describe is often described as the "ANTI JOIN" concept in SQL. I would recommend indeed to accomplish this using a SQL script recipe in Dataiku.

Unfortunately, there is not a common syntax for ANTI JOINs in SQL, as for regular LEFT/RIGHT/INNER/... JOINS. Depending on the flavour of SQL database you are using, you can look up the specific syntax you need. For instance, in PostgreSQL you can use a syntax like this:


SELECT *
FROM actor a
WHERE NOT EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

This is documented online in this article: https://blog.jooq.org/2017/01/12/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql/. For other SQL flavours, similar articles certainly exist to help you write the right script.

Hope it helps,

Alex

View solution in original post

Alex_Combessie · ‎12-10-2018

Hi,

What you describe is often described as the "ANTI JOIN" concept in SQL. I would recommend indeed to accomplish this using a SQL script recipe in Dataiku.

Unfortunately, there is not a common syntax for ANTI JOINs in SQL, as for regular LEFT/RIGHT/INNER/... JOINS. Depending on the flavour of SQL database you are using, you can look up the specific syntax you need. For instance, in PostgreSQL you can use a syntax like this:


SELECT *
FROM actor a
WHERE NOT EXISTS (
  SELECT * FROM film_actor fa
  WHERE a.actor_id = fa.actor_id
)

This is documented online in this article: https://blog.jooq.org/2017/01/12/a-probably-incomplete-comprehensive-guide-to-the-many-different-ways-to-join-tables-in-sql/. For other SQL flavours, similar articles certainly exist to help you write the right script.

Hope it helps,

Alex

acrow6 · ‎12-11-2018

Thank you, Alex.

AshleyW · ‎01-30-2023

Hi @acrow6 ,

We've recently released a new version of Dataiku, 11.3, and you can now add additional output datasets to a Join recipe that contain all the unmatched rows from either a left, right, or inner join. This will let you easily create datasets with rows that don't match the join!

Anti-joins/exclusion joins are on our roadmap.

Cheers,

Ashley

Join that REMOVES matching rows

Join that REMOVES matching rows

Labels

Join recipe

Visual recipes

Sign up to take part

Join that REMOVES matching rows

Join that REMOVES matching rows

Labels

Join recipe

Visual recipes