0 votes

Hi,

 

When I run a python script in the notebook editor with the same Python environment as the default environment (python 2) and save the results to a new dataset (CSV) using:

dataiku_dataset = dataiku.Dataset("Demo_Api_REST_V1_Import")
dataiku_dataset.write_with_schema(new_dataset, dropAndCreate=True)

 

Then the new dataset is correctly encoded with UTF-8 and contains no u'data' or, in my case and the main problem, json arrays with {u'key': u'value'} which totally mess up my JSON prepare recipes afterwards.

 

But when I run this using either the flow or the python script editor (not the notebook one), all my data gets saved as unicode u'strings'.

 

It's the exact same script with the same environments and result dataset (no input dataset, since that's what this script is doing). Needless to say I am enforcing any way of UTF-8 encoding I can possibly think of.

while (next_url != None):
    api_url = next_url
    response = requests.get(api_url, headers=headers)
    json_response = json.loads(response.content.decode('utf-8'), encoding="utf-8")
    if (json_response.get('_links') != None and json_response.get('_links').get('next')) != None:
        next_url = json_response.get('_links').get('next').get('href')
    else:
        next_url = None
    if (json_response.get('_embedded') != None and json_response.get('_embedded').get('items')):
        if (akeneoDf is None):
            akeneoDf = pd.read_json(json.dumps(json_response.get('_embedded').get('items')), encoding="utf-8")
        else:
            akeneoDf = akeneoDf.append(pd.read_json(json.dumps(json_response.get('_embedded').get('items')), encoding="utf-8"))

 

How come?

asked by
edited by
Hello, are you comparing the pandas data frame created in the notebook with the output dataset created in the flow?
Yes. When I refresh the same dataset after either running the script or the notebook version, the difference is immediately noticable. It's the same dataset in the flow.
What happens if you copy paste exactly the same recipe code and run it in a notebook? Then you can go back to the dataset view, refresh the sample and check.
Not quite sure what you exactly mean here.

The code is the same in the python recipe as it is in the notebook. I've written the recipe in the notebook. The problem is running it in the notebook gives me the right utf-8 encoded output/dataset. But running it as a python script or part of the flow, does not.
Have you refreshed the sample of the datasets after each comparison? https://imgur.com/a/uWgvV8y
Hi Alex,

apparently this problem persisted only and only in Python 2.

After I switched to Python 3, it worked as it should've.

1 Answer

+1 vote
Best answer

Hi,

Encoding is handled differently in Python 2 and Python 3. 

However, when you execute the exact same Python code with the exact same version of Python and libraries, and then write the pandas dataframe output to a Dataiku dataset using the same method (write_with_schema for instance) there cannot be any difference in the Dataiku dataset output, whether it is run from a recipe or a notebook. This assumes that you compare the two in the Dataiku dataset view interface, after refreshing the sample:

However, if you compare the pandas dataframe view you get in the notebook (before writing to a dataset) with the dataset sample view, there can be small differences, in particular with encoding. This is expected by design as a pandas dataframe is a Python in-memory object while a dataset in Dataiku is physically stored as a text file or database table. Pandas dataframe and Dataiku datasets are two different types of objects.

I hope it helps to clarify the matter.

Cheers,

Alex

answered by
selected by
992 questions
1,026 answers
1,079 comments
3,222 users

┬ęDataiku 2012-2018 - Privacy Policy