Beyond Flat Walks: Compositional Abstraction for Autoregressive Molecular Generation

Halıcıoğlu Data Science Institute, UC San Diego
MOSAIC logo animation

Abstract

Autoregressive models for molecular graph generation typically operate on flattened sequences of atoms and bonds, discarding the rich multi-scale structure inherent to molecules. We introduce MOSAIC (Multi-scale Organization via Structural Abstraction In Composition), a framework that lifts autoregressive generation from flat token walks to compositional, hierarchy-aware sequences. MOSAIC provides a unified three-stage pipeline: (1) hierarchical coarsening that recursively groups atoms into motif-like clusters using graph-theoretic methods (spectral clustering, hierarchical agglomerative clustering, and motif-aware variants), (2) structured tokenization that serializes the resulting multi-level hierarchy into sequences that explicitly encode parent-child relationships, partition boundaries, and edge connectivity at every level, and (3) autoregressive generation with a standard Transformer decoder that learns to produce these structured sequences. We evaluate MOSAIC on the MOSES and COCONUT molecular benchmarks, comparing four tokenization schemes of increasing hierarchical expressiveness. Our experiments show that hierarchy-aware tokenizations improve chemical validity and structural diversity over flat baselines while enabling control over generated substructures. MOSAIC provides a principled, modular foundation for structure-aware molecular generation.

Introduction

Molecular graphs are a natural representation for drug discovery and property prediction: nodes are atoms, edges are bonds, and validity and function depend heavily on recurring substructures — rings, functional groups, and their connectivity. While discriminative graph neural networks excel at predicting properties from such graphs, generating novel, valid molecules remains challenging. Autoregressive transformers, which have scaled successfully in language and code, offer an attractive backbone for graph generation when graphs are first tokenized into sequences. Existing tokenization schemes typically flatten the graph into a single linear order — a random walk or depth-first traversal — so that a causal transformer can predict the next token. These flat walks capture local adjacency and some long-range edges via back-references, but they do not expose the graph’s compositional hierarchy: which atoms belong to which ring or functional group, and how those units connect. The model must therefore infer motif structure implicitly from the token stream, which can lead to broken rings, inconsistent motif statistics, and limited control over high-level chemistry.

MOSAIC addresses this gap by introducing an explicit hierarchical graph (H-graph) abstraction and a three-stage pipeline for autoregressive molecular generation. First, we coarsen the molecular graph into a set of communities and inter-community edges, optionally using domain knowledge (ring and functional-group detection) to guarantee motif cohesion. Second, we tokenize the H-graph into a flat sequence using a coarse-before-fine ordering: the model sees community-level structure before atom-level detail, mirroring the compositional nature of chemistry. Third, we flatten the generated token sequence back into a molecular graph. All tokenization schemes support a lossless roundtrip from graph to tokens and back.

Pipeline Overview

MOSAIC encoding pipeline: input molecule is decomposed into typed communities and tokenized into a hierarchical sequence
(a) Encoding. A molecular graph (camptothecin) is decomposed into typed communities (rings, functional groups, singletons) and serialized into a hierarchical token sequence via HDTC.
MOSAIC decoding pipeline: a causal transformer generates tokens that are decoded into a functional decomposition and reconstructed into a molecule
(b) Decoding. A causal GPT-2 transformer autoregressively generates a token sequence, which is decoded into a functional decomposition and reconstructed into a valid molecular graph.

Coarsening Strategies

The coarsening step determines how atoms are grouped into communities. We explore two families: flexible methods that optimize a graph-theoretic objective without domain knowledge, and constraint methods that leverage chemical structure to define communities. Spectral clustering uses the eigenvectors of the graph Laplacian to identify natural clusters, selecting the number of partitions that maximizes Newman–Girvan modularity. Hierarchical agglomerative clustering (HAC) takes a bottom-up approach, iteratively merging the most similar adjacent nodes based on bond-type-weighted distances, which tends to preserve local chemical neighborhoods. Motif-aware coarsening (MC) first identifies known chemical motifs (rings, functional groups) via SMARTS pattern matching, then clusters the remaining atoms with spectral or HAC methods. This ensures that chemically meaningful substructures are preserved as intact units. MC with functional groups (MC+FG) extends this with an expanded library of functional-group patterns for finer-grained control over which substructures remain intact during coarsening.

Unconditional MOSES generations from each tokenization
Unconditional MOSES generations from each tokenization. The reference molecule is shown alongside outputs from all nine model variants: SENT (flat walk), H-SENT-SC and HDT-SC (spectral coarsening), HSENT-HAC and HDT-HAC (hierarchical agglomerative coarsening), H-SENT-MC and HDT-MC (motif-constrained coarsening), and HDTC (motif + functional group typing). As the tokenization imposes progressively stronger chemical priors, the generated molecules become more structurally coherent and closer to real drug-like molecules.

Tokenization Schemes

Given a coarsened hierarchy, we serialize it into a flat token sequence for autoregressive modeling. A shared design principle across all tokenizers is coarse-before-fine ordering: community-level structure always precedes atom-level detail, so the model first commits to the high-level decomposition before filling in local connectivity. H-SENT (Hierarchical SENT) encodes multi-level partition structure with explicit partition boundaries and bipartite cross-community edge blocks; each community’s atoms are serialized via a SENT walk with back-edge brackets. HDT (Hierarchical DFS Tokenization) encodes the full tree structure using depth-first traversal with ENTER/EXIT nesting tokens, capturing both intra- and inter-community edges as back-edges during the DFS, with no separate bipartite reconstruction needed. HDTC (HDT with Composition) is the most expressive scheme: it types each community node (ring, functional group, singleton), encodes a super-graph of inter-community bonds, and serializes atom-level detail within each typed community block.

Comparison of three hierarchical tokenization schemes: H-SENT, HDT, and HDTC
Hierarchical tokenization schemes illustrated on cholesterol. (A) H-SENT partitions the molecular graph into community blocks (colored regions) with explicit cross-community edges. (B) HDT arranges atoms under partition nodes in a DFS tree; orange curves show graph edges, gray arrows trace the traversal. (C) HDTC adds typed community nodes (R = ring, F = functional group, S = singleton) and blue super-edges between communities.
Visualization of MOSAIC coarsening strategies showing how molecular graphs are decomposed into hierarchical communities
Coarsening strategies. Visualization of how different coarsening methods (spectral clustering, HAC, motif-aware) decompose molecular graphs into hierarchical communities. Each colored region represents a community of atoms grouped together during the coarsening step.

Results

We evaluate unconditional generation on MOSES (~1M drug-like molecules, 10–26 heavy atoms) and COCONUT (~5K complex natural products, 30–100 heavy atoms, ≥4 rings). All models use a GPT-2 backbone (12 layers, 768 hidden, 12 heads, ~85M parameters) trained with identical hyperparameters. We generate 500 molecules per model and evaluate against 5,000 randomly sampled references from the combined train+test set. We report four representative tokenizers: SENT (flat baseline), H-SENT MC (hierarchical flat-walk), HDT MC (motif-community hierarchy), and HDTC (typed compositional). Bold = best among the four per dataset.

Table 1. MOSES and COCONUT full-reference evaluation (500 generated molecules, 5,000 full-reference samples). ↑ = higher is better, ↓ = lower is better. Metric sources: aMOSES benchmark, bFréchet ChemNet Distance, cPolyGraph Discrepancy, daugmented from AutoGraph, eours.
MOSES (full ref.)COCONUT (full ref.)
SENTH-SENT MCHDT MCHDTCSENTH-SENT MCHDT MCHDTC
Validitya0.8680.8510.8910.8350.5810.8810.8780.893
Uniquenessa1.0001.0001.0001.0001.0000.7980.8550.772
Noveltya0.9860.9380.9360.9551.0000.8870.9270.925
FCDb2.352.442.302.445.404.603.253.72
SNNa0.3960.4390.4450.4350.2640.7830.7790.767
Fraga0.9950.9950.9950.9920.9630.9540.9690.972
Scaffa0.8600.8960.8640.8280.0000.1430.1140.116
IntDiva0.8640.8590.8550.8580.8840.8800.8810.879
PGDc0.0000.0000.0000.0510.3080.3170.0000.125
FG MMDd0.0020.0020.0020.0020.0020.0020.0020.002
SMARTS MMDd0.0020.0030.0020.0020.0030.0040.0020.003
Ring MMDd0.0040.0060.0040.0060.0110.0040.0020.003
BRICSd0.0160.0180.0180.0220.0380.0450.0360.036
Motif Ratee0.7900.8320.8060.8360.5750.6470.6450.708

On MOSES (simple drug-like molecules), hierarchy provides modest but consistent improvements: SENT achieves 86.8% validity while HDT MC reaches 89.1% with the best FCD (2.30) and PGD (0.000). H-SENT MC achieves the best scaffold similarity (0.896). HDTC leads on motif rate (0.836). All four achieve near-identical motif-level fidelity (FG MMD ≤ 0.002). On COCONUT (complex natural products), the gap is dramatic: SENT drops to 58.1% validity while all hierarchical tokenizers achieve 87–90%, a 31-point improvement. No single model dominates distributional metrics: H-SENT MC leads SNN (0.783) and scaffold similarity (0.143), HDT MC achieves perfect PGD (0.000) and the best FCD (3.25), and HDTC leads motif rate (0.708) and fragment similarity (0.972). All hierarchical models achieve comparable distributional fidelity (FCD 3.1–4.6, all MMDs ≤ 0.004), suggesting that the choice of tokenizer and coarsening is less about overall quality than about which specific aspects of the distribution to optimize.

Full Results

MOSES Full Results (all 8 model variants)
Table 2. MOSES: unconditional generation quality and motif-level fidelity (500 generated molecules, evaluated against 5,000 randomly sampled references from the combined train+test set). Bold = best, underline = second best.
SENTH-SENT MCH-SENT SCH-SENT HACHDT MCHDT SCHDT HACHDTC
Validity ↑0.8680.8510.8420.2090.8910.2200.2130.835
Uniqueness ↑1.0001.0001.0001.0001.0001.0001.0001.000
Novelty ↑0.9860.9380.8110.8760.9360.8820.8590.955
FCD ↓2.352.442.5710.812.309.2710.892.44
SNN ↑0.3960.4390.4360.4620.4450.4540.4590.435
Frag ↑0.9950.9950.9930.9510.9950.9620.9500.992
Scaff ↑0.8600.8960.7100.6690.8640.6030.6870.828
IntDiv ↑0.8640.8590.8610.8460.8550.8530.8550.858
PGD ↓0.0000.0000.0450.7100.0000.7190.7630.051
FG MMD ↓0.0020.0020.0020.0060.0020.0050.0060.002
SMARTS MMD ↓0.0020.0030.0030.0130.0020.0070.0130.002
Ring MMD ↓0.0040.0060.0020.0230.0040.0120.0190.006
BRICS MMD ↓0.0160.0180.0200.0610.0180.0550.0610.022
Motif Rate ↑0.7900.8320.7250.9000.8060.8680.8830.836
COCONUT Full Results (all 8 model variants)
Table 3. COCONUT: unconditional generation quality and motif-level fidelity (500 generated molecules, evaluated against 5,000 randomly sampled references from the combined train+test set). Bold = best, underline = second best.
SENTH-SENT MCH-SENT SCH-SENT HACHDT MCHDT SCHDT HACHDTC
Validity ↑0.5810.8810.8750.8890.8780.8850.8990.893
Uniqueness ↑1.0000.7980.8780.7710.8550.8750.8720.772
Novelty ↑1.0000.8870.9190.9380.9270.9130.9390.925
FCD ↓5.404.603.124.283.253.403.183.72
SNN ↑0.2640.7830.7710.7470.7790.7770.7820.767
Frag ↑0.9630.9540.9720.9680.9690.9640.9690.972
Scaff ↑0.0000.1430.1410.1250.1140.1510.1150.116
IntDiv ↑0.8840.8800.8820.8810.8810.8800.8800.879
PGD ↓0.3080.3170.0540.3770.0000.0270.0000.125
FG MMD ↓0.0020.0020.0020.0020.0020.0010.0020.002
SMARTS MMD ↓0.0030.0040.0020.0040.0020.0020.0020.003
Ring MMD ↓0.0110.0040.0020.0040.0020.0020.0020.003
BRICS ↓0.0380.0450.0340.0360.0360.0400.0360.036
Motif Rate ↑0.5750.6470.6160.6030.6450.6320.6340.708
Generation Gallery
Generation gallery showing reference molecules from MOSES alongside molecules generated by each MOSAIC tokenizer variant
Generation gallery (MOSES). Reference molecules from the MOSES dataset (left column) paired with the closest-sized valid generation from each model variant. Each row targets a different molecular size.
Generation gallery showing reference molecules from COCONUT alongside molecules generated by each MOSAIC tokenizer variant
Generation gallery (COCONUT). Reference molecules from the COCONUT dataset (left column) paired with the closest-sized valid generation from each model variant. Rows span the atom-count range of the dataset, from small fragments to large natural products.

Discussion

Hierarchy is necessary for complex molecules. On MOSES (~20 atoms), flat walks achieve 86.8% validity and hierarchy adds only 2.3 points. On COCONUT (30–100 atoms), flat walks drop to 58% validity while all hierarchical tokenizers reach 87–90%, a 31-point gap. The benefit scales with molecular complexity: larger molecules have real compositional structure that hierarchy can exploit.

Enumeration and traversal are complementary. H-SENT serializes molecules as a catalog of parts plus a wiring diagram (explicit index blocks and bipartite edge lists), excelling at distributional precision (best FCD 3.12, best Subst TV/KL). HDT serializes molecules as one continuous DFS walk with nesting, excelling at valid and diverse generation (best validity 89.9%, best PGD 0.000). The coarsening strategy amplifies this: spectral coarsening pairs with H-SENT’s enumeration, HAC pairs with HDT’s traversal.

Coarsening robustness scales with molecular size. On small MOSES molecules, only chemistry-aware MC coarsening works (generic methods drop to ≤22% validity). On complex COCONUT molecules, all coarsening methods achieve 87–90% validity, because larger molecules contain genuine hierarchical motifs that even naive algorithms can discover.

Typed decomposition gives motif precision, not distributional dominance. HDTC’s R/F/S type labels yield the highest motif rate (0.708) and fragment similarity (0.972), but untyped models match or beat it on FCD, SNN, and PGD with greater diversity. The diversity cost of hierarchy is a one-time flat-vs-hierarchical penalty, not a gradient that worsens with more constraint.

Substructure metrics generalize better than molecule-level similarity. SNN and scaffold similarity drop 2–5× between full-reference and test-only evaluation, while fragment similarity and MMD metrics remain stable. This gap is uniform across all models, driven by the small training set (5K molecules) rather than tokenizer-specific memorization.

BibTeX Citation

@article{bian2025mosaic,
title = {Beyond Flat Walks: Compositional Abstraction for
Autoregressive Molecular Generation},
author = {Bian, Kaiwen and Yang, Andrew H. and Parviz, Ali
and Mishne, Gal and Wang, Yusu},
year = {2025},
url = {https://github.com/KevinBian107/MOSAIC},
}