Intelligent System for Automatic Selection of Machine Learning Algorithms in the Social Sciences

Status: Active

Project Duration: May 2021. - April 2026.

Acronym: SIMON

Code: UIP-2020-02-6312

Project Leader: Assoc. Prof. Dr. Dijana Oreški

Institutional Affiliation: University of Zagreb, Faculty of Organization and Informatics

Team Members (FOI):

Assoc. Prof. Dr. Nikola Kadoić

Assoc. Prof. Dr. Igor Pihir

Assoc. Prof. Dr. Irena Konceki

Assoc. Prof. Dr. Goran Hajdin

Marija Pokos Lukinec, MA in Educational Informatics and Italian Philology

Dunja Višnjić, MSc in Economics

External Members:

Assoc. Prof. Dr. Maja Rožman (University of Maribor, Slovenia)

Assist. Prof. Dr. Maja Gligora Marković (University of Rijeka)

Assist. Prof. Dr. Milica Maričić (University of Belgrade)

Introduction and Motivation

In today’s world, the amount of data has exploded. Every day we leave digital traces and generate vast amounts of data - through social networks, education systems, economic activities, and business processes. Every social media view, “like,” or purchase in a store is recorded and stored. These data often hide valuable insights that we would like to uncover - for example: predicting student learning outcomes, understanding consumer behavior, forecasting economic trends, and more. To achieve this, we use artificial intelligence and machine learning algorithms.

In AI and machine learning, there are numerous algorithms, each with its strengths and weaknesses. Some perform better on one type of data, while others suit different types of problems. Even experts spend a great deal of time testing “which algorithm works best” on a specific dataset.

The SIMON project aims to simplify and automate this process. The main idea is to develop an intelligent system that, based on dataset characteristics, can recommend or select the most suitable machine learning algorithm for that dataset. This saves time, reduces the risk of poor algorithm choice, and enables a broader community of researchers to use advanced algorithms without needing deep expertise in machine learning.

The project is funded by the Croatian Science Foundation (HRZZ) through the “Installation Research Projects” line (UIP-2020-02). The HRZZ announced that the SIMON project was selected for funding in this program.

Project Goals

The SIMON project has several interrelated goals focused on developing an automated system, understanding algorithm behavior, and contributing to the scientific community and broader society.

Goals:

Explore dataset characteristics
The first step was to analyze the features of various datasets (from education, economics, and social sciences) - the number of variables, relationships among them, missing data, distribution differences, etc. These are the “dataset attributes” the system uses for decision-making.
Analyze algorithm behavior
For numerous dataset-algorithm combinations (e.g., regression, decision trees, neural networks), the project tests how each algorithm performs on different data types, which parameters are important, and where algorithms fail. This builds “knowledge about algorithms.”
Develop meta-model(s) for recommendation
A meta-model learns from experience - the base model learns from data, while the meta-model learns from “models + datasets.” It recommends the most likely suitable algorithm based on dataset descriptions. Approaches include algorithm ranking, multi-criteria methods, and performance prediction.
Implement an intelligent system
The final output is a software prototype where a user can input data and receive algorithm recommendations. The system considers dataset characteristics and recommends the most appropriate algorithm or configuration.
Evaluate and validate the system
The recommendations will be tested on real datasets from the social sciences and business domains to verify functionality, accuracy, and identify areas for improvement.
Develop research capacity and infrastructure
Since the UIP program supports establishing new research groups, the goal includes building a team in the field of meta-learning and its applications in social sciences. The project funds equipment, young researchers (students, PhD candidates), and infrastructure.
Dissemination and long-term sustainability
Project results have been published in over 40 scientific papers (some in top-ranked journals) and presented at international conferences. The team plans to use results in education, expand collaborations, and ensure long-term sustainability through new national and international projects.

Duration and Resources

The SIMON project runs from May 2021 to April 2026.

Total budget: €146,742.78

The project is part of HRZZ’s broader strategy to support independent research careers and the formation of new research groups in Croatia.

Key Activities (Methodology)

To achieve its objectives, the project follows these steps:

Data collection and selection
Identify representative datasets from social sciences (education, economics, sociology) and develop a repository.
Feature extraction
For each dataset, compute metadata - e.g., number of attributes, variability, correlations, instances, proportion of non-numeric variables, distributions, etc. These describe the dataset for meta-models.
Experimental testing of algorithms
Test numerous machine learning algorithms (e.g., regressions, decision trees, ensembles, neural networks) with varying parameters on each dataset. Record performance metrics (accuracy, precision, error rates).
Result analysis and meta-knowledge modeling
Develop models linking dataset characteristics to algorithm performance - e.g., “If the dataset has many highly correlated attributes, algorithm X will likely outperform algorithm Y.”
Recommendation system development
Combine the meta-model and software infrastructure into an algorithm recommender system. Users upload datasets, and the system analyzes and returns recommendations.
Validation on independent data
Test reliability using datasets not included in training to ensure generalization to new scenarios.
Improvements and iterations
Iteratively refine models and interface based on observed errors, uncertainties, and limitations.
Dissemination and community engagement
Publish results, host workshops, and potentially open the system for other researchers to test and integrate.

Expected Contributions and Applications

The SIMON project brings multiple benefits - both scientific and practical:

Simplified research workflow:
Particularly valuable in social and humanities research, where users often lack deep expertise in AI and ML.
Time and resource savings:
Instead of manually testing dozens of algorithms and parameters, the system provides quick recommendations.
Better understanding of algorithm behavior:
Meta-learning insights help explain why certain algorithms perform better on specific data types.
Promotion of interdisciplinary work:
Bridges computer science (machine and meta-learning) with social sciences, economics, and education.
Foundation for future systems:
The developed system can be extended and applied beyond the social sciences.
Development of research infrastructure in Croatia:
Builds a new research group, trains young scientists (e.g., Marija Pokos Lukinec and Dunja Višnjić), and strengthens national capacity for AI-driven research.

Limitations and Challenges

Every project faces challenges, and SIMON is no exception:

Diversity of datasets:
Social science data are often heterogeneous - varying sizes, unbalanced distributions, missing data, and mixed variable types (text, categorical).
Overfitting risk:
Meta-models might learn recommendations that perform well on training data but generalize poorly.
Interpretability limits:
While the system can recommend algorithms, users still need to interpret and validate results.
Resources and scalability:
Testing numerous algorithms across datasets requires substantial computational resources.
Maintenance and sustainability:
Post-funding, the system’s continuity will rely on new projects, collaborations, or institutional support.