Sign up to take part
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Registered users can ask their own questions, contribute to discussions, and be part of the Community!
Hi, I need to load a large number of CSV files from the server filesystem where the second row in each file contains a date field (one CSV = 1 date). I would like to copy the date to a new column in the data set. Appreciate any advice on how to do this. Thankyou
@SimonM Are the files you are referencing on the DSS Design Node? Are they stored on some other server? Are you trying to load them into a particular DSS project as managed datasets?
Is this a one-time process? Or are you needing this to occur as part of a flow?
I don't have first hand experience, but I'm happy to try to recreate your use locally to figure it out (if I can)
Thanks Tim. Where the files are stored is not really important. The CSVs are coming in daily but I also have a backlog of them. Each CSV has a 3 row header. The middle row contains some text including the date. I'd like to extract this date and place it in a new column as each CSV is loaded. I then merge all the CSVs into a single table for analysis. Initially I loaded the data using files in folder.
I did solve this pretty easily using MS PowerQuery but I can't figure out how to do it in DSS. I'm trying to use Visual Recipes.
Hi @SimonM. Can you share a sample of your CSV files? That could be useful to help you.
Do you have something like this?
# First line of header bla bla bla
# Second line 2020-10-02 more info blablabal
# Third header line
column1 column2 column3 ... columnN
Could you confirm? (and if possible attach a small csv example... ๐ )
Cheers
I.
I don't have a sample on me at the moment but it is very simple: described below.
Title,
Interval Detail For: 01/06/2020,
"Customer Details",
Col1,Col2, Col3
Time, Decimal, Alpha
Time, Decimal, Alpha
Cheers,
@SimonM, I am honestly not sure how to do this in a non-hacky way using visual recipes - though I am not saying it can't be done (I just don't know).
In the meantime I have written some logic that can combine a bunch of files (of the format you specified) from a single folder data source and concatenate those in a single step.
Flow:
โData inside sample_csvs:
โrecipe output (processing +stacking the 2 csvs):
Here is the code inside the recipe that I used to actually perform the logic. If this looks like it may work for you, I'm happy to help you tweak the python code for your specific need.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
#------------------Two helper functions ----------------------#
#-------------------------------------------------------------#
import re
import os
from datetime import datetime
def get_date(string, date_pattern):
""" define a regex pattern to extract first date match in a string"""
match = re.search(date_pattern, string)
date = datetime.strptime(match.group(1), '%m/%d/%Y').date()
return date
def process_file(file, header_rows, date_row, date_pattern):
"""
process a single csv. This function will skip `header_rows` rows and create a dataframe
from the remaining data. The date_pattern argument will be used to extract the first date
from date_row and insert it as a new column in the data. The modified dataframe will be
returned
"""
with open(file,'r') as f:
for i, line in enumerate(f):
if i == date_row-1:
date = get_date(line, date_pattern)
df = pd.read_csv(file, skiprows=header_rows)
df['date'] = date
return df
#-------------------------------------------------------------#
#-------------------------------------------------------------#
# Read recipe inputs
sample_csvs = dataiku.Folder("b5vBISdg")
folder_path = sample_csvs.get_path() # path to this data folder
files = sample_csvs.list_paths_in_partition() # file names in this folder
file_paths = [os.path.join(folder_path, file[1::]) for file in files] ## create paths to the actual csvs (will be a list of 2 here)
dfs = [] # empty list to store the results of each processed file
for path in file_paths:
dfs.append(process_file(path, header_rows=3, date_row=2, date_pattern='(\d{2}/\d{2}/\d{4})')) # Process each csv and append to list
stacked_dataset_df = pd.concat(dfs) # stack the processed csvs
# Write recipe outputs
stacked_dataset = dataiku.Dataset("stacked_dataset")
stacked_dataset.write_with_schema(stacked_dataset_df)
Nice solution @tim-wright! I'm not sure neither if this would be possible with a visual recipe, but for this case I don't think we will find something that matches the efficiency of the python solution.
Unless there is some plugin that provides this functionality as a visual step!
Thanks @tim-wright. It's been useful to understand the limitations of the Visual Recipes. I'm specifically trying to do this without code but appreciate your efforts.
I believe one way to do this with Visual Recipes might be to:
I'm pretty sure this will work but it's a painful way of solving what should be a simple requirement.
Cheers
@SimonM Ah, I missed that part of your post. If you absolutely need to use the visual recipes, then yes, I think the approach you laid out would probably work.
If this is something you plan to do frequently or with a bunch of datasets, I'd propose that the investment time to develop a plugin (which would be minimal) is worth it so that users can simply leverage a UI to accomplish the task without knowing code or getting wrapped up with the logic you mention above.