0 votes


I have a service that continuously emits data. You start receiving data once you have connected opening a TCP connection and never stops until you terminate the connection.

I'd like to develop a custom plugin to be able to process that data on Dataiku how I can do that as data never ends?

Will "build" log overload the server?




We are loading data from a flight's metasearch service. They expose a data stream we consume polling from a TCP connection (https://github.com/gbrian/Flightmate-Stream). We plan to use Dataiku to parse, sanitize, ... data and the drop into Hadoop apart from applying the corresponding analysis and lab ;)

@alexander Hope this helps

asked by
edited by
Hi Gustavo, this is an interesting topic. To best advise you, I would like to better understand the context. What type of data do you receive? Do you have an estimation of the volume? What technologies do you have in mind for the processing and the storage? Cheers, Alex
Any news on this Gustavo?

1 Answer

0 votes
Hi Gustavo,

For this type of use case, we would advise performing the data ingestion outside of Dataiku DSS, with a streaming engine such as Flume or Kafka.

Once the data is ingested, you can perform data transformation and machine learning modelling in DSS in a micro-batch way,  using partitions to avoid recomputing on the whole data: https://doc.dataiku.com/dss/latest/partitions/index.html


answered ago by
821 questions
844 answers
746 users