Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi,
We've been trying to upload a parquet file to Dataiku's Managed folder but facing UnicodeDecodeError. I tried uploading csv format files which are working as expected but not able to upload parquet files. In my use case I need to upload parquet files. Is there any way we can upload the parquet files.
Below is the screenshot of the error.
Best,
Sagar
Hi,
Could you confirm how you are trying to copy the parquet file to the manged folder? Perhaps by sharing a code snippet.
Using get_download_stream and upload_stream should work if the input file is already parquet format.
Here is a simple example of copying a parquet file.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
input_folder = dataiku.Folder("uWgyw8kG")
output_folder = dataiku.Folder("Ni8VoNvi")
filename = "userdata1.parquet"
parquet_file = input_folder.get_download_stream(filename)
output_folder.upload_stream("userdata1_copied.parquet", parquet_file)
If you need to convert a dataset to parquet then you can use something like this, please note for this you will need to add pyarrow to your code env
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import io
# Read recipe inputs
orders = dataiku.Dataset("orders")
orders_df = orders.get_dataframe()
# define managed folder output
output_folder = dataiku.Folder("uWgyw8kG")
output_filename = "orders.parquet"
#convert to parquet
f = io.BytesIO()
orders_df.to_parquet(f)
#write output
output_folder.upload_data(output_filename, f.getvalue())
Let me know if this helps or please share the code that is generating the error.
Hi,
can you attach the python code used to to the upload? In particular, how the source parquet file is opened of fetched, and which upload method is used
Hi,
Could you confirm how you are trying to copy the parquet file to the manged folder? Perhaps by sharing a code snippet.
Using get_download_stream and upload_stream should work if the input file is already parquet format.
Here is a simple example of copying a parquet file.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
input_folder = dataiku.Folder("uWgyw8kG")
output_folder = dataiku.Folder("Ni8VoNvi")
filename = "userdata1.parquet"
parquet_file = input_folder.get_download_stream(filename)
output_folder.upload_stream("userdata1_copied.parquet", parquet_file)
If you need to convert a dataset to parquet then you can use something like this, please note for this you will need to add pyarrow to your code env
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import io
# Read recipe inputs
orders = dataiku.Dataset("orders")
orders_df = orders.get_dataframe()
# define managed folder output
output_folder = dataiku.Folder("uWgyw8kG")
output_filename = "orders.parquet"
#convert to parquet
f = io.BytesIO()
orders_df.to_parquet(f)
#write output
output_folder.upload_data(output_filename, f.getvalue())
Let me know if this helps or please share the code that is generating the error.
Thanks @AlexT and @fchataigner2 for the response.
@AlexT your approach worked for me. Since my file was already in a parquet format I used the upload_stream to upload my file to the Managed folder.
Thanks.