IDEATE

Can deep learning work in the real world?: A data-centric perspective (IDEATE) - [2023 - 2026] Key Words: Deep learning, Learning with Noisy Labelling, Data-Centric Deep Learning, Uncertainty Modelling, Self-supervised Learning

Name of the project: Can deep learning work in the real world?: A data-centric perspective (IDEATE)

Principal Investigator (PI, Co-PI,..): Petia Radeva & Ricardo Marques

Funding entity: Ministry of Science, Innovation and Universities, State Agency of Investigation, Spain

Duration: 01/09/2023 - 30/08/2026

The effectiveness and efficiency of ML/DL systems depend on the nature of the data and the models’ capacity. Standard ML pipelines are built around specific training tasks characterised by a heuristic model specification, an available training dataset, and an independent and identically distributed evaluation procedure. These properties make the models’ application in real conditions difficult; available effects of spurious correlations, unwanted biases, or opaque predictors in trained models are now quite widespread

Hypothesis: We hypothesise that improving the data set quality often results in better performance than just mindlessly fiddling with model hyperparameters. Collecting data, cleaning it, and making it suitable for ML training takes up to 90% of the time. However, with the recent rapid DL development, it becomes clear that data-centric ML/DL is one of the next challenges. Since data will never be fully clean in real scenarios, DL models should be prepared to cope with imperfect data during model training using robust model training techniques.

General objective of the IDEATE project based on our research perspective,

O1: Develop novel and robust Data-centric DL algorithms where the data is the key to improve the ML models. Address the challenges of imperfect real data.
O2: Prove the achievements of the novel techniques in real scenarios: Real-world scenarios often face data or label scarcity, missing, unbalanced and noisy data and noisy labels. We successfully apply our algorithms on domains like: food data analysis, medical image analysis, omics data analysis and industrial applications.