Many scientific improvements require generating structured objects rather than classifying existing ones, especially those in biological sciences. Molecules, proteins, brain networks, and other biological systems are naturally represented as graphs, where nodes denote atoms or neurons and edges denote mechanical, neuronal, chemical interactions. At this scale, graphs are not arbitrary: they are composed of recurring structural patterns that constrain the validity and function of the graph. These recurring patterns, which we refer to as motifs, are small subgraphs with well-defined structure. For example, in molecular chemistry, such motif may be benzene ring, which is a six-node cycle with strict connectivity constraints that shows often within other molecules. In proteins, this may shown as secondary-structure elements such as α-helices and β-sheets that define higher-level building blocks. And in diffusion tensor imaging of the brain, this may be represented as multi-region connectivity patterns where sub-networks of cortical regions jointly participate in a structural pathway. Preserving these motifs is essential for generating valid, interpretable, and robust graphs.
This project explores a simple idea: instead of asking generative models to recover motif-level structure implicitly, we encode motifs directly into the representation used for graph generation or treat it as a loss regularization upon training, both independent of the inner processing of sequence transformer. Cover page image in courtesy of Protein Data Bank and this paper.