0 votes
While loading my data, Pandas translates some types wildly, transforming integer to double, strings (with only numbers in it) to decimal, removes leading 0 in strings such as "00042", and so on.

I'd like to set the types of the column while loading the data, to avoid wild type conversion and screwing up my data, how can I do that ?

my_df = dataiku.Dataset("my_data")
# my_df= my_df.get_dataframe(infer_with_pandas=False)
my_df= my_df.get_dataframe()

infer_with_pandas=False didn't work : Job failed : Error in Python process: <type 'exceptions.ValueError'>: Integer column has NA values in column 13. I get what it's telling me, but I do have NAs in some of my columns, and I need to keep them.
asked by
retagged by

1 Answer

+1 vote

Hi, 

You need to set the storage type in DSS to string:

Then when you do a python recipe, you should say to pandas to not infer types. You load the dataframe wih:

test = dataiku.Dataset("test")

test_df = test.get_dataframe(infer_with_pandas=False)

 

However, the NA is going to disappear because pandas magically convert it to NaN. So you could fill the empty value with 'NA'.

test_df["id"] = test_df["id"].fillna("NA")

 

If you write test_df in an output dataset, you'll retrieve your original id column. 

Matt

answered by
edited by
I've just empty cells and I get the same error on integer columns:
ValueError: Integer column has NA values in column 14
I additionally get on a boolean column with empty cells:
ValueError: cannot safely convert passed user dtype of |b1 for object dtyped data in column 15

With or without fillna I get the same error. Did I miss something?
But the Boolean error is still there, after understanding the int limitations...
ValueError: cannot safely convert passed user dtype of |b1 for object dtyped data in column 15
Ok got it myself: get_dataframe(infer_with_pandas=False, bool_as_str=True)
From docstring:
        * bool_as_str -- Leave boolean values as strings (default False)
863 questions
891 answers
848 comments
1,172 users