0 votes
Hi,

Is it possible to make calls to DSS's Open Refine Server. We want to make connection to Open Refine server that DSS is already using, in "Python Recipe" instead of "Prepare Recipe"

Please let us know how we can do it.

Waiting for your kind response.

Regards,

Samriddhi
asked by
edited by
Hi Sam, Open Refine Server is not a built-in capability of Dataiku DSS. Has your organization developed a proprietary plugin to add it? Could you please detail this specific integration with documentation and screenshots?
Hi Alex,

When speaking with Dataiku's team before we've been told that OpenRefine is being leveraged for some of the prepare operations. Given the striking similarity of DSS's ability to cluster and merge rows of a text column to that of OpenRefine's text clustering capability (http://www.padjo.org/tutorials/open-refine/clustering/), we assumed that this was one of those operations.

DSS's clustering works great, but we need to be able to do this step programmatically for automated updates when the data changes, and for handling columns containing 1M+ total rows and 10K+ rows to merge. That many find / replace operations in a prepare recipe causes the browser to become unresponsive.

1 Answer

0 votes

Hi,

The clustering feature in a "Prepare" recipe will not dynamically update when the dataset changes, and is only designed for small datasets fitting into memory.

For clustering large datasets on text with automated updates, we advise using a clustering recipe: https://doc.dataiku.com/dss/latest/machine_learning/unsupervised.html

Cheers,

Alex

answered by
Alex,

Traditional clustering, e.g. unsupervised learning that is provided in DSS clustering recipes, is very different from the type of fuzzy matching that is being done here. The problems you describe above, though, are exactly why we need to be able to do this programmatically. The link that Sam posted: http://www.padjo.org/tutorials/open-refine/clustering/ shows the OpenRefine text facet clustering that appears to be the capability that DSS is leveraging within prepare recipes.

So the question is simply whether it is possible to make calls from Python to the OpenRefine server that we believe to be running with DSS (as this shows: https://doomicile.de/story/simple-text-analysis-using-python-identifying-named-entities-tagging-fuzzy-string-matching-and ), or whether we need to install our own OpenRefine server or seek a different programmatic solution.

Thank you for your time and help.

Best,
John
Hi John, Sam,
Thanks for the explanation. Access to the OpenRefine server included in DSS is not currently supported. I have relayed your request to our R&D team.
There are several ways to implement this in DSS.
1. Without code, with visual DSS features: using a clustering algorithm on vectorized text with a high number of cluster - I have used it successfully myself at several occasions, it works well for a moderate amount of cluster (<300)
2. With code: many python libraries offer fuzzy matching functionalities. The closest one to your need would be https://github.com/OpenRefine/refine-client-py/blob/master/README.rst. That requires to install an Open Refine server alongside Dataiku DSS. Else, you can use the fuzzywuzzy python library, which does not require to install open refine.
Hope it helps,
Alex
Alex,

Thank you very much. Yes I have successfully used the visual prepare recipe to merge around ~3K clusters found from ~700K rows, but the browser becomes very unresponsive. Packages like fuzzywuzzy and fuzzyset are great for matching mis-spelled terms to a dictionary of known correct terms, but what we have here is a bit different. We have a big list of terms and have no idea which, if any, are spelled correctly, and just need to cluster together the ones that likely refer to the same entity.

Thanks for the help and the github link. We'll check it out and get something working!

Best,
John
891 questions
920 answers
898 comments
1,397 users