If the data does not adequately represent the context of the problem to be solved, the resulting models and systems can introduce many biases and complications.
Imagine that in your business you want to know if a customer is about to leave in the short term, but you don’t have the information you need to find out. Many organizations share this problem, and it is critical to provide them with an adequate solution.
A common data science strategy to address this need is to build and train predictive models using available insights. The main challenge in any predictive model is not only to have the necessary data, but also to have quality data.
Clearly, using data that poorly represents the context or problem can lead to illogical or absurd predictions. A clear example of potential misrepresentation is that of ImageNet, a recurrent image database used to train facial recognition systems. Approximately 45% of the photographs in the system were taken in the US and represent North Americans and Europeans, while China accounts for 1%.
It is important to keep in mind that the model or algorithm always adapts to the training data we give it, so it is essential for the data scientist team to evaluate whether the data quality/representation is adequate and sufficient.
Returning to the problem of identifying customers who will or will not leave a company, it is common to find a misrepresentation in the data that allows to detect the means used by the company to contact its customers and whether they were actually contacted. In general, having such contact information improves the prediction of the models by 5%, but many companies do not have it and believe that it is not necessary.
It is obvious that the biases that appear in the evaluation of the results of the predictive models are mostly due to the rush to put into production an analytical solution (in this case, the trained model), but not to the lack of technology (in fact, the algorithms are massively available for use).
More precisely, the main failure is not to evaluate the quality of the data before putting the model into training and production. This is why it is so important to have a team of data science professionals who understand that bad decisions made by poorly trained models will have a direct and negative impact on the business.