Research | Andrew G. Duncan

Phylogenetic augmentation for supervised learning

Homologous DNA sequences contain conserved regulatory information and can be used to augment training data for deep learning models. I developed a data augmentation technique for improving the performance of deep learning models trained on genomic sequences using homologous sequences from multi-species genome alignments.

GitHub Paper Data

Generative enhancer models (EnhancAR)

We developed a model that generates enhancer sequences given a set of homologous enhancer sequences. By prompting the model with enhancer sequences from different species, we can generate a diverse set of novel enhancer sequences that are predicted to have similar regulatory functions. We use this model to design synthetic enhancers with desired regulatory functions, and also to shorten enhancers while preserving their predicted regulatory activity.

GitHub Paper Data EnhancAR Model EnhancAR-Sorted Model

Deciphering the cis-regulatory code in mouse embryonic stem cells

Mouse embryonic stem cells (ESC) represent a model system for studying the cis-regulatory code. By training supervised and unsupervised deep learning models on mouse ESCs, we can understand how regulatory information is encoded into the genome.