Beyond Flat Walks: Compositional Abstraction for Autoregressive Molecular Generation

Halicioglu Data Science Institute, UC San Diego

Abstract

Autoregressive models for molecular graph generation typically operate on flattened sequences of atoms and bonds, discarding the rich multi-scale structure inherent to molecules. We introduce MOSAIC (Multi-scale Organization via Structural Abstraction In Composition), a framework that lifts autoregressive generation from flat token walks to compositional, hierarchy-aware sequences. MOSAIC provides a unified three-stage pipeline: (1) hierarchical coarsening that recursively groups atoms into motif-like clusters using graph-theoretic methods (spectral clustering, hierarchical agglomerative clustering, and motif-aware variants), (2) structured tokenization that serializes the resulting multi-level hierarchy into sequences that explicitly encode parent-child relationships, partition boundaries, and edge connectivity at every level, and (3) autoregressive generation with a standard Transformer decoder that learns to produce these structured sequences. We evaluate MOSAIC on the MOSES and COCONUT molecular benchmarks, comparing four tokenization schemes of increasing hierarchical expressiveness. Our experiments show that hierarchy-aware tokenizations improve chemical validity and structural diversity over flat baselines while enabling control over generated substructures. MOSAIC provides a principled, modular foundation for structure-aware molecular generation.

Introduction

Discovering new molecules — for drugs, materials, or chemical tools — is a slow and expensive process. Recent advances in generative AI offer a way to accelerate this: train a model on known molecules and let it propose entirely new ones. But molecules aren’t just sequences of characters. They have rich internal structure: rings, branches, and recurring building blocks that determine their chemical properties.

Most existing approaches flatten a molecular graph into a linear sequence of atoms and bonds, then generate one token at a time — much like a language model that can only produce text one letter at a time. It technically works, but the model has to rediscover higher-level patterns (like common ring structures) from scratch. MOSAIC is more like generating at the word level: it learns to compose molecules from meaningful building blocks rather than individual atoms.

Pipeline Overview

MOSAIC operates through a three-stage pipeline. First, molecular graphs are recursively coarsened into multi-level hierarchies where atoms are grouped into structurally meaningful clusters. Second, these hierarchies are serialized into structured token sequences that preserve parent-child relationships and inter-cluster connectivity. Third, a Transformer decoder is trained to autoregressively generate these structured sequences, enabling the model to compose molecules from coarse structure down to fine-grained atomic detail.

MOSAIC pipeline overview showing the three stages: hierarchical coarsening, structured tokenization, and autoregressive generation
MOSAIC pipeline overview. A molecular graph (camptothecin) is hierarchically coarsened into multi-level clusters, tokenized into a structured sequence encoding the hierarchy, and generated autoregressively by a Transformer decoder.

Coarsening Strategies

MOSAIC supports multiple graph coarsening strategies that recursively partition molecular graphs into hierarchical clusters. Each strategy offers different trade-offs between preserving chemical motifs and computational efficiency.

Comparison of coarsening strategies: Spectral, HAC, Motif-aware Spectral, and Motif-aware HAC
Coarsening strategies. Different approaches to recursively partitioning a molecular graph. Motif-aware variants (right) preserve chemically meaningful substructures like rings and functional groups as intact clusters.

Tokenization Schemes

The coarsened hierarchies are serialized into token sequences using one of four tokenization schemes, each capturing increasing levels of hierarchical information:

Comparison of four tokenization schemes: SENT, H-SENT, HDT, and HDTC
Tokenization schemes. From left to right: increasing hierarchical expressiveness. SENT provides a flat baseline, while H-SENT, HDT, and HDTC progressively encode more structural information about the molecular hierarchy.

Results on COCONUT

We evaluate unconditional generation on the COCONUT dataset of complex natural products (~5K molecules, 30–100 heavy atoms, ≥4 rings). All models use a GPT-2 backbone (12 layers, 768 hidden, 12 heads) trained with identical hyperparameters. We generate 500 molecules per model and compare against the full reference dataset. Bold = best, underline = second best.

Table 1. Unconditional generation quality on COCONUT (500 generated molecules, full reference). ↑ = higher is better, ↓ = lower is better.
SENTH-SENT MCH-SENT SCH-SENT HACHDT MCHDT SCHDT HACHDTC
Validity ↑0.6180.8840.3960.1280.8920.0680.1440.918
Uniqueness ↑1.0000.8760.9341.0000.9461.0001.0000.858
Novelty ↑1.0000.7560.9141.0000.8591.0001.0000.802
FCD ↓6.946.3610.5620.944.5024.9317.174.74
SNN ↑0.2690.9890.7070.2410.9910.2420.2400.990
Frag ↑0.9440.9480.9260.8000.9650.7540.8850.969
Scaff ↑0.0000.1390.0990.0000.1110.0000.0000.150
IntDiv ↑0.8870.8760.8830.8790.8810.8650.8880.882
PGD ↓1.0001.0001.0001.0001.0001.0001.0001.000

HDTC achieves the highest validity (91.8%) and fragment similarity (0.969), while HDT MC is a close second (89.2% validity). The flat-walk baseline SENT drops to 61.8% validity — a 30-point gap — showing that flat tokenization struggles with complex molecular graphs. Motif-community (MC) coarsening consistently outperforms both spectral (SC) and agglomerative (HAC) coarsening across all hierarchical tokenizers, confirming the importance of preserving known chemical motifs during coarsening.

Table 2. Motif-level fidelity and realistic proportions on COCONUT (500 generated molecules, full reference). All metrics ↓ except Motif Rate ↑.
SENTH-SENT MCH-SENT SCH-SENT HACHDT MCHDT SCHDT HACHDTC
FG MMD ↓0.0040.0030.0060.0180.0020.0310.0150.003
SMARTS MMD ↓0.0050.0060.0140.0390.0020.0660.0280.004
Ring MMD ↓0.0120.0090.0100.0590.0030.0610.0320.005
BRICS MMD ↓0.0460.0510.0520.0870.0370.0970.0650.037
Motif Rate ↑0.5530.6430.5560.4380.6120.1760.3890.704
Subst. TV ↓0.0670.0230.0100.2750.0810.2780.2170.052
Subst. KL ↓0.0180.0020.0000.3620.0162.5090.1790.008
FG TV ↓0.0430.0480.0690.1090.0320.2040.0980.028
FG KL ↓0.0930.1160.1050.9420.0902.4800.1430.086

HDT MC and HDTC dominate across all four MMD metrics. HDTC leads on motif rate (70.4% of valid molecules contain at least one recognized motif) and functional-group fidelity. SC and HAC variants lag significantly, reinforcing that the motif-community and compositional approaches excel at preserving complex substructure distributions in large molecules.

Generation gallery showing reference molecules from MOSES alongside molecules generated by each of the six MOSAIC tokenizer variants
Generation gallery. Reference molecules from MOSES (left column, gray background) paired with the closest-sized valid generation from each of six model variants. Each row targets a different molecular size, demonstrating that hierarchical tokenizers produce structurally diverse, chemically plausible molecules across the atom-count range.

Generation Demo

The following animation shows MOSAIC’s autoregressive generation process, where the model builds a molecule token-by-token, progressively assembling the hierarchical structure from coarse partitions down to individual atoms and bonds.

Animated demonstration of MOSAIC's autoregressive molecular generation process
Autoregressive generation demo. MOSAIC generates a molecule by sequentially predicting tokens that encode hierarchical structure, atom types, and bond connectivity.

BibTeX Citation

@article{bian2025mosaic,
title = {Beyond Flat Walks: Compositional Abstraction for
Autoregressive Molecular Generation},
author = {Bian, Kaiwen and Yang, Andrew H. and Parviz, Ali
and Mishne, Gal and Wang, Yusu},
year = {2025},
url = {https://github.com/KevinBian107/MOSAIC},
}