0 votes


I'm Dataiku and ML beginner, so excuse my (maybe) simple question.

I have a dataset with data on internet companies. Originally it came with ">" and "," separated info on target markets (column: markets). There are some extra columns on eg. #of employees, financing etc to the right.

My goal is to create a model with "activity" as a target variable (it has 3 values: operating, acquired and non-operating). Eg, to identify the most promising markets to "survive", or the most dangerous (causing "non-operation").

My original file had 1 record per company (app. 1 000 companies), with only "markets" column. I started with splitting it, first with ">", and then "," as separators. Finally (after some cleaning and merging) I got the dataset with many records per company, as displayed below, with distinct "market__" features.

My questions:

1. Is it OK for ML model to keep a data on a single company in a form of many records (see picture below)? 

2. Is there any other procedure of data preparation (folding, splitting, transformation, etc) You would recommend?

I would greatly appreciate Your help, 

Many thanks in advance, 


asked by

1 Answer

0 votes

Applying Machine Learning requires to understand the business problem behind your prediction task. Hence, you need to adapt your methodology to the problem at hand. In your specific case, I would recommend clarifying what is your goal:

1. Detecting if companies will be acquired in the future? or will still be operating? In this case, you need to define a time window and shift your target. You may have to aggregate data to reduce the number of observations by company.

2. Attribute a current status to companies about which you have business information, but do not know if they are acquired/operating/not?

If you are in the VC industry, I guess goal 1 could be of interest to you. In this case, be careful in the way you handle your temporal features.

Good luck with this interesting project!
answered by
Dear Alexandre,

thank you very much for the answer. In my case I have no temporal data, so my goal is 2. (Atrribute....): having some characteristics of the company I would like to predict it's most probable status.

My main concern is data preparation for that scenario. In the original file I had 1 line per company, with all it's characteristics in the "markets" columns (theses ">" and "," separated values). In order to prepare the data I used different splits and got the state as in the picture.

My question, as a ML beginner: is it OK (for modelling) to have multiple records per 1 entity? Especially, that in the model options I choose only target variable (in my case: "activity"), and have no option for "entity" variable (for me Id or name)? Or some other data preparation method would be more appropriate?

I would really appreciate Your advice,

Many thanks in advance :)

It is OK to have multiple lines in your training set for a given entity. But avoid using an identifier colum as a feature in your model. Also it could be helpful to compute derivatived features based on the previous history on the given entity. That is if you are able to define a notion of temporal order.
Thank you :)
892 questions
921 answers
1,403 users