0 votes
Hi,

I am trying to upload multiple excel files with multiple sheets at the same time into one dataset. Is it possible to add a special column indicating the original dataset (excel file + sheet)?

Many thanks
by
reopened by
Hi, Are the columns the same across all sheets?
Hi, no, the columns are not necessarily the same across all sheets. But those columns that are of our interest (and are reflected in the schema) are the same.

1 Answer

0 votes

Hi,

If columns are different across sheets, you will need to upload the dataset multiple times for each set of sheets with the same columns. Another option would be to put the file in a managed folder and use several files-in-folder datasets. Then you can use a stack recipes to remap the columns as necessary. 

Alternatively, you could read the file with python from a DSS managed folder as such:

import dataiku
import pandas as pd, numpy as np
import os

folder_path = dataiku.Folder("MYFOLDER").get_path()
excel_path = os.path.join(folder_path, "MYFILE.xlsx")
sheet_list = ["SHEET1", "SHEET2"]


df_dict = {
    k: pd.read_excel(excel_path, sheet_name = k, engine = "xlrd")
    for k in sheet_list
}


for k, df in df_dict.items():
    df["origin_sheet_name"] = k


df_stacked = pd.concat([v for k,v in df_dict.items()], axis = 0, ignore_index = True, sort = False)

Hope it helps,

Alex

by
Hi Alex,
many thanks for your answer, however the code does not seem to work.
Even though I specified the list of sheets for option "sheet_name=", it always reads only the first excel sheet.  Even when I specifically select one particular sheet (other than first) it gives me the same result - first sheet.
Do you know what could be the issue here?

Romana
Hi, I would advise to test with various options of pandas read_excel() for instance the engine.
1,258 questions
1,287 answers
1,462 comments
11,809 users

┬ęDataiku 2012-2018 - Privacy Policy