Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
I have a big dataiku dataset in the flow which I am then reading into a pandas dataframe using the get_dataframe() method. The problem happens when I have to read 4 such datasets into memory which slows down the processing. I don't need to read the entire dataset, only a subset has to be read into the pandas dataframe. Is there a way I can do this?
Currently, I am reading the entire data into pandas and then slicing the pandas dataframe and I would like to avoid doing this.
Thanks!
Hi @SuhailS7 ,
Yes, you can easily get the desired number of rows with the help of the below code.
mydataset = Dataset("myname") for df in mydataset.iter_dataframes(chunksize=10000) # df is a dataframe of at most 10K rows.
You can refer to the following doc for more info - https://doc.dataiku.com/dss/latest/python-api/datasets-data.html
Best,
Madhuleena
Thanks @nmadhu20 but what if I need a specific number of rows for my processing? For eg - I need the rows that only contain a specific value. If the dataiku dataset has 1,000,000 records, there may be only 500 rows that satisfy the condition I am looking for and I don't want to read all 1,000,000 rows into a pandas dataframe for this purpose.
Currently I am performing 2 operations, loading it into memory and then slicing the dataframe, I was hoping if it is possible to do this in one operation.
To my knowledge, you can use the iter_rows() function of the Dataset object to iterate over your rows without loading the entire data frame into memory as you would not need get_dataframe(). Then you can go ahead with the required filter but this involves the processing to be at each row-level.
iter_rows(sampling='head', sampling_column=None, limit=None, ratio=None, log_every=- 1, timeout=30, columns=None)
mydataset = Dataset("myname") for row in mydataset.iter_rows()
#required filter on a row basis
This returns each row as a dictionary. More info in this link - https://doc.dataiku.com/dss/latest/python-api/datasets-reference.html
Hope this helps!
Madhuleena