Compound MOA Prediction#

Stable

Predicting compound mechanisms of action from cellular imaging data

This project explores self-supervised learning approaches for predicting the mechanism of action (MOA) of pharmacological compounds using microscopy images from the BBBC021 dataset. We implement and compare three different representation learning methods to predict MOAs.

Overview#

The BBBC021 dataset contains fluorescence microscopy images of human cells treated with various compounds at different concentrations. Each image captures three cellular components: actin filaments, microtubules, and nuclei. Our goal is to learn meaningful representations that can predict a compound’s mechanism of action based solely on the induced morphological changes.

Approaches#

1. Baseline ResNet-50

  • Standard ImageNet-pretrained ResNet-50 for feature extraction

  • Provides a baseline for comparison with self-supervised methods

2. SimCLR (Self-Supervised Contrastive Learning)

  • Vanilla SimCLR: Creates positive pairs from augmented versions of the same image

  • Weakly-Supervised SimCLR: Uses compound labels to form positive pairs from different images of the same compound, while treating different compounds as negatives

3. DINO (Self-Distillation)

  • Weakly-supervised adaptation using compound labels

  • Student-teacher architecture with exponential moving average updates

Key Features#

  • 2 SSL Algortihm iImplementations to predict compound MOAs

  • Typical Variation Normalization (TVN): Removes systematic noise by normalizing against DMSO control samples

  • Comprehensive Evaluation: 1-nearest neighbor classification with multiple distance metrics

  • Visualization Pipeline: t-SNE and UMAP embeddings for qualitative analysis

  • Jupyter Notebooks for easy experimenting and exploration

  • Documentation that looks very good lol

Dataset#

The BBBC021 dataset consists of:

  • 103 distinct compounds with known mechanisms of action

  • Multiple concentrations per compound (typically 8 concentrations)

  • 4 replicate images per treatment condition

  • 3-channel fluorescence images (1024×1280 pixels)

  • 12 different mechanisms of action categories

Installation#

To get started:

  1. Clone the repository:

git clone https://github.com/lukagerlach/CompoundProfiling.git
cd CompoundProfiling
  1. Set up Python environment:

python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate     # Windows
  1. Install dependencies:

pip install -r requirements.txt
  1. Download and preprocess the dataset:

If you are on RAMSES you can just use our default location for data which is /scratch/cv-course2025/group8/bbbc021. If not, run:

python data/pybbbc_loader.py

Usage#

Notebooks#

For a very convenient way to check out our data, results and also training your own model, we created some Notebooks. you can find them in the notebooks folder. Else, you can just use the scripts that are described below:

Training Models#

Train vanilla SimCLR:

python training/simclr_vanilla_train.py

Train weakly-supervised SimCLR:

python training/simclr_ws_train.py

Train DINO model:

python training/wsdino_resnet_train.py

Feature Extraction#

Extract features using a trained model:

python evaluation/extractor.py

Evaluation#

Evaluate model performance:

python evaluation/evaluator.py

Visualization#

Generate t-SNE/UMAP plots:

python evaluation/visualize_embeddings.py

Project Structure#

CompoundProfiling/
├── data/                    # Data loading and preprocessing
├── docs/                    # all things shinx
├── models/                  # Model architectures
├── training/               # Training scripts
├── evaluation/             # Evaluation and visualization tools
├── experiments/            # Experimental utilities (TVN, etc.)
├── notebooks/              # Jupyter notebooks for exploration

Results#

Results are available in our Project Documentation.

Dependencies#

Core requirements:

  • PyTorch

  • torchvision

  • scikit-learn

  • numpy

  • pandas

  • matplotlib

  • tqdm

  • pybbbc (for dataset access)

See requirements.txt for complete dependency list.

License#

This project is licensed under the MIT License - see the LICENSE file for details.

Contents#

Project Information:

Indices and tables#