0 votes

Hi There,

 

I am trying to make use of the Textblob package within a Dataiku recipe.

More specifically I'm trying to create a python recipe which translates a column "Description" from Russian to English using this package.

I'm basing myself on the script which I found here in the context of a Kaggle competition: 

https://www.kaggle.com/gunnvant/russian-to-english-translate-with-progress-bar

I wanted to have a try to to see how I could incorporate this into a Dataiku Recipe (I took out the references to the progres bar part, which I don't need here).

 

--------------------------------------------------------------

My input is "translate_2" which consists out of two columns

-"ID": Integers

-"Description": Russian words with a few missings

My output is "output"

----------------------------------------------------------------------

 

 

I have reworked the code into the result below to integrate it into Dataiku:

 

# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import sys
import textblob


# Read recipe inputs
train_Raw_filtered = dataiku.Dataset("translate_2")
x = train_Raw_filtered.get_dataframe()

 

    
#Takes data frame as input, then searches and fills missing description with недостающий (russian for "missing")
   
def desc_missing(x):
   
    if x['Description'].isnull().sum()>0:
        x['Description'].fillna("недостающий",inplace=True)
        return x
    else:
        return x

x=desc_missing(x)
  

#Translate

def translate(x):
    try:
        return textblob.TextBlob(x).translate(to="en")
    except:
        return x
    
x=translate(x)
   
    
#Map to new column
def map_translate(x):
    x['en_desc']=x['Description']
    return x

x=map_translate(x)


# Write recipe outputs to dataiku
train_Raw_Translated = dataiku.Dataset("output")
train_Raw_Translated.write_with_schema(x)

 

 

The code runs without error. It does impute the "missing" value, but I do not seem to succeed to write the actual translation 

into the Dataiku recipe output. It just inherits the original values:

 

When I take a look at the logs I find this line which I don't know how to interpret at this point:

 

Bottom line:

  • I would expect the en_desc to contain the translation but it does not.
  • Do you guys have any input what I'm doing wrong here? I seem not to be able to figure out what is going wrong here.

Any help would be appreciated.

Thanks a million.

 

Kind Regards,

Tim

 

asked by
edited by

1 Answer

0 votes
Hi Tim,

This is a python question, not linked to Dataiku DSS. Actually, the log is fine, and the way you read and write through the dataiku package is correct.

Then it is a matter of debugging your code.

We advise prototyping in a jupyter notebook first so you can execute block by block interactively. Some advice: prototype on a smaller sample, add print statements and never use an except clause without returning the error. Otherwise your code could be wrong but you would not be able to see it.

In particular I would inspect the behaviour of your translate function.

Cheers,

Alex
answered by
893 questions
923 answers
905 comments
1,433 users