A question we often ask ourselves is how many machine learning (ML) opportunities are prematurely ended due to conclusions of inadequate data for required performance. Each data science project typically starts with benchmarking of existing techniques and ML pipelines followed by fine tuning and customization until the desired performance is reached. At some stage a decision must be made on whether that performance is attainable. Balanced efforts should be placed on fine-tuning of ML pipelines and models that have a starting poor performance versus implementing the custom solution the problem really needs. The best strategy to apply is often guided by the intrinsic knowledge of the learning mechanism of ML models in line with the experimental findings on the data. Besides experience intuition also plays a role on settling on a conclusion or new direction.
In early ML adoption years, the percentage of machine learning models that never made the deployment stage or degraded fast was large. Indeed, after auditing several ML models and approaches used in deployment, largest problems detected were in the lack of critical data pre-processing and application appropriate model evaluation metrics. This was more severe in multi-model applications where the same approach was assumed to work on all settings and no automated performance monitoring was in place. Relatively little effort was placed on customizing the ML pipeline to achieve possible performance boosts. Nowadays, the matters have improved with increasing amounts of shared knowledge and tools on best practices as they evolve. However, even applying the now more advanced out-of-box ML pipelines and models still carries risks. If they become the norm for ticking the performance box, the application may be deprived of a more reliable and higher performing model. An even greater risk is prematurely concluding that a well performing machine learning solution is not possible in applications that require it the most.
As an example, our recent project was to conclude on viability of ML models trained on patient treatment response data for a disease with a high mortality rate. Previous analysis using statistics and ML could not reach the necessary predictive performance as most patterns found in the discovery cohort were not reliable in the validation cohort. Indeed, this was the smallest number of samples we ever had available to train the ML models on (a maximum of 18 samples per treatment type). The customization process began with over-sampling and feature subset selection phases tailored to the problem/data. These gradually increased performance of the tested models but the high over-fitting problem was still present. It was a hard decision to make to carry on given all the contrary results so far and generally made claims of ML not working with so little data. However, observing partial performance across modelling settings, the only way to conclude was to implement a custom predictive mechanism. Our new approach unified diverse models and settings in a multi-level ensemble modelling strategy. Its predictive ability was well above the required performance for all three treatment types. This work became our record of success of smallest data we achieved deployable models for and largest performance gain over existing ensemble models.