This use case focuses on developing a foundation AI model specifically for heterogeneous biological imaging data. The model will enable more effective categorisation, search, and reuse of vast, diverse datasets stored in bioimaging archives, thereby enhancing data accessibility and value for users and RI operators.
Foundational Models for Heterogeneous Biological Image Data

Challenge
Biological imaging data from various experimental conditions, organisms, and modalities is growing rapidly. These diverse datasets require computational models to support organisation, categorisation, and accessibility. Foundation models trained on this mixed data can generalise well and enable a range of downstream applications, from similarity searches to measurements. This use case undertakes the large-scale training of such a foundational model.
Target
Develop and train a foundational AI model capable of generating high-quality embeddings from diverse biological image data to support categorisation, searchability, and other downstream tasks.
Development Steps
-
Select, curate, and standardise a large dataset covering multiple biological imaging modalities and experimental variables from the EMBL-EBI archive.
-
Create task-specific evaluation datasets through a combination of selective labelling and use of existing annotated data.
-
Fine-tune pre-existing natural image segmentation models on this biological data to create relevant benchmarks.
-
Train a new, large-scale biological imaging foundation model optimised for data discoverability, automated organisation, and reusable outputs for other scientific analyses.
Relevance / Target Stakeholders
-
Operators of Research Infrastructures (RIs) managing biological imaging archives
-
Researchers and users of imaging data requiring enhanced data discovery and analysis tools
Impact
The resulting model will significantly improve how RIs manage, organise, and offer access to their biological image archives. This will:
-
Enhance data discoverability and reuse
-
Increase the scientific value of existing archives
-
Reduce manual curation effort and support scalable data services in life sciences research