0 votes

I want to understand and evaluate the cost of building a data science platform that has the capabilities listed below - 

Data Ingestion - File uploads from filesystem (FTP, SFTP)
    Cloud (S3)
    HDFS
    Oracle
    Plugin support
Data Versioning - Ability to manage versions of data
File Format Support CSV
    Text
    JSON
    Excel
Automatic Schema Detection  
Data Wrangling - Visual interactive & collaborative data cleaning and data imputation
Data Preparation - Apply data transformations (visually)
    Variable type detection
    Encoding
    Data grouping and aggregation
Data Pipeline - Ability to visually create and manage data pipelines
    Automating & Scheduling data pipelines
Machine Learning - Comparing models
    Feature engineering
    Model versioning
Distributed Processing  
Data Mining Interactive & collaborative notebooks for data exploration
Data Visualization
    Many built in charts
    Ability to integrate javascript libraries (d3, leaflet etc)
    Dashboards for executives
Design To Production    
    Expose your model as REST api's
    Running multiple versions of the same model for testing

Can someone guide me on what tools/frameworks would we need to add on top of apache spark and zeppelin to get the expected results?

asked by dsdev
edited

1 Answer

+1 vote

Two possible solutions here:

  1. Buying Dataiku DSS, if you are interested you should contact our sales team. The main advantage is the cost is limited to the price of our license, no additional tools/frameworks or development costs required.
  2. Not buying Dataiku DSS, you can refer to this year's Gartner Magic Quadrant to find inspiration on what software you should try to copy. Once you have found inspiration, you will simply need a few years of work, a team of engineers, and a little bit of funding.
answered by
923 questions
956 answers
953 comments
1,781 users