Compound MOA Prediction#

Predicting compound mechanisms of action from cellular imaging data

This project explores self-supervised learning approaches for predicting the mechanism of action (MOA) of pharmacological compounds using microscopy images from the BBBC021 dataset. We implement and compare three different representation learning methods to predict MOAs.

Overview#

The BBBC021 dataset contains fluorescence microscopy images of human cells treated with various compounds at different concentrations. Each image captures three cellular components: actin filaments, microtubules, and nuclei. Our goal is to learn meaningful representations that can predict a compound’s mechanism of action based solely on the induced morphological changes.

Approaches#

1. Baseline ResNet-50

Standard ImageNet-pretrained ResNet-50 for feature extraction
Provides a baseline for comparison with self-supervised methods

2. SimCLR (Self-Supervised Contrastive Learning)

Vanilla SimCLR: Creates positive pairs from augmented versions of the same image
Weakly-Supervised SimCLR: Uses compound labels to form positive pairs from different images of the same compound, while treating different compounds as negatives

3. DINO (Self-Distillation)

Weakly-supervised adaptation using compound labels
Student-teacher architecture with exponential moving average updates

Key Features#

2 SSL Algortihm iImplementations to predict compound MOAs
Typical Variation Normalization (TVN): Removes systematic noise by normalizing against DMSO control samples
Comprehensive Evaluation: 1-nearest neighbor classification with multiple distance metrics
Visualization Pipeline: t-SNE and UMAP embeddings for qualitative analysis
Jupyter Notebooks for easy experimenting and exploration
Documentation that looks very good lol

Dataset#

The BBBC021 dataset consists of:

103 distinct compounds with known mechanisms of action
Multiple concentrations per compound (typically 8 concentrations)
4 replicate images per treatment condition
3-channel fluorescence images (1024×1280 pixels)
12 different mechanisms of action categories

Installation#

To get started:

Clone the repository:

git clone https://github.com/lukagerlach/CompoundProfiling.git
cd CompoundProfiling

Set up Python environment:

python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate     # Windows

Install dependencies:

pip install -r requirements.txt

Download and preprocess the dataset:

If you are on RAMSES you can just use our default location for data which is /scratch/cv-course2025/group8/bbbc021. If not, run:

python data/pybbbc_loader.py

Usage#

Notebooks#

For a very convenient way to check out our data, results and also training your own model, we created some Notebooks. you can find them in the notebooks folder. Else, you can just use the scripts that are described below:

Training Models#

Train vanilla SimCLR:

python training/simclr_vanilla_train.py

Train weakly-supervised SimCLR:

python training/simclr_ws_train.py

Train DINO model:

python training/wsdino_resnet_train.py

Feature Extraction#

Extract features using a trained model:

python evaluation/extractor.py

Evaluation#

Evaluate model performance:

python evaluation/evaluator.py

Visualization#

Generate t-SNE/UMAP plots:

python evaluation/visualize_embeddings.py

Project Structure#

CompoundProfiling/
├── data/                    # Data loading and preprocessing
├── docs/                    # all things shinx
├── models/                  # Model architectures
├── training/               # Training scripts
├── evaluation/             # Evaluation and visualization tools
├── experiments/            # Experimental utilities (TVN, etc.)
├── notebooks/              # Jupyter notebooks for exploration

Results#

Results are available in our Project Documentation.

Dependencies#

Core requirements:

PyTorch
torchvision
scikit-learn
numpy
pandas
matplotlib
tqdm
pybbbc (for dataset access)

See requirements.txt for complete dependency list.

License#

This project is licensed under the MIT License - see the LICENSE file for details.

Useful Links#

BBBC021 Dataset - Original dataset homepage
pybbbc Documentation - Python package for BBBC021 access
SimCLR: A Simple Framework for Contrastive Learning - Original SimCLR paper
Self-Supervised Learning of Phenotypic Representations from Cell Images with Weak Labels - WS-DINO method
GitHub Repository - Source code
Project Documentation - Detailed docs

Contents#

Getting Started:

Installation

API Reference:

API Reference
- Data
- Models
- Training
- Evaluation
- Experiments

Research:

Results

Project Information:

Compound MOA Prediction

Contents