Drug repurposing — identifying new clinical indications for existing approved drugs — reduces development cost and timeline by leveraging established safety profiles. The principal computational challenge is hypothesis generation: predicting which drug-disease pairs are therapeutically relevant from heterogeneous biological data, without positive labels for unseen pairs.
Existing machine learning approaches (knowledge graph embedding, GNNs) achieve Recall@10 of 31–65% but are transductive: they train on 80% of known drug-disease pairs and evaluate on the remaining 20%. They cannot generalize to novel drugs or disease contexts absent from their training set, and they produce scalar scores with no mechanistic interpretability.
The EasyAtom corpus was frozen in 2023. It integrates five public biomedical databases:
| Source | Content | Version | Contribution |
|---|---|---|---|
| DrugBank | Drug → gene target interactions, pharmacology | v5.x | Drug-gene edges (L0) |
| OMIM | Gene → disease Mendelian associations | 2023 | Gene-disease edges (L0) |
| CTD | Chemical → gene curated interactions | 2023 | Causal chemical-gene (L1) |
| Hetionet v1.0 | Integrated biomedical KG (11 node types) | v1.0 | PPI, pathway context (L0–L3) |
| STRING v11 | Protein-protein interaction network | v11 | Protein interaction (L3) |
Total corpus: 2.56M triples (corpus_1M_3col.tsv) + 3.9M derived hypotheses = ~6M total. SHA-256 fingerprint: 0ff11993fb8746a9f1eb3dcf241e074c486acae5c27d77f0a4a0dd17a6fb9997. No "treats" drug-disease edges are included at any stage.
The pipeline processes the corpus through 16 sequential algebraic layers (L0–L15). No layer is trained; all operations are deterministic algebraic transformations.
| Layer | Operation | Output |
|---|---|---|
| L0 — HDC | Hyperdimensional encoding of all entities into D=1024 binary vectors via XOR/permutation algebra | Entity vector space |
| L1 — Causal | Do-calculus symbolic inference: drug→gene→disease transitive closure with confounding removal | Causal chains per pair |
| L2 — HAM | Hamiltonian energy scoring via RK4 integration of simulated quantum Hamiltonian H_D | Energy scores per pair |
| L3 — ATT | Attractor condensation: fixed-point iteration on disease state space | Disease attractors |
| L4 — SPE | Born-rule spectral simulation O(N·D²) on classical hardware | Probability amplitudes |
| L5 — PRI | Causal prime factorization of gene pathways | Prime gene sets |
| L6 — EMB | Semantic embedding via Jaccard similarity over gene-set overlap | Drug-disease similarity matrix |
| L7 — GAP | Gap detection: drugs with known targets for a disease but no confirmed association | 41,396 novel candidates |
| L8 — KO | DWPC knockout perturbation: score impact of gene silencing on drug-disease paths | 6,397 candidates; 50 evaluated |
| L9 — INT8 | Int8 distillation into 10 domain shards for mobile deployment | 10 × compressed shards |
| L10 — WM | World model forward chaining: urgency scoring via knowledge gap propagation | Urgency-ranked candidates |
| L11 — COM | Combination synergy scoring (drug cocktails) | 47 DDI-safe cocktails |
| L12 — REP | Full repurposing matrix cross-product | 266,561 candidates |
| L13–L15 | DDI safety filter, N-of-1 protocol generation, index | 20 N-of-1 protocols |
The benchmark evaluates whether the engine can recover known drug-disease associations (from the corpus) when those associations are excluded as inputs. This is a strict zero-shot inductive protocol:
| Metric | Value | Interpretation |
|---|---|---|
| Recall@1 | 4.0% | Known disease ranked #1 for that drug |
| Recall@5 | 17.4% | |
| Recall@10 | 28.6% | Main reported metric (zero-shot) |
| Recall@50 | 54.7% | Half the corpus recoverable in top-50 |
| NDCG@10 | 0.822 | High rank quality — hits rank 1–3, not 8–10 |
| MRR | 0.113 | Mean reciprocal rank |
| Causal enrichment R₇ | 2.68× | Pairs with causal chain 2.68× more likely at rank 1 |
| Method | Recall@10 | Setting | Trains on drug-disease? |
|---|---|---|---|
| Random baseline (our corpus) | 4.7% | Zero-shot | No |
| Popularity baseline (our corpus) | 11.2% | Zero-shot | No |
| EasyAtom v4.3 (internal corpus) | 28.6% | Zero-shot inductive | No |
| EasyAtom v4.3 (Broad Hub ext.) | 21.3% | Zero-shot inductive | No |
| Hetionet Rephetio 2017 | ~27% | Supervised (different dataset) | Yes |
| TransE (RepoDB) | ~31% | Transductive | Yes — 80% training split |
| RotatE / DRKG | 38–42% | Transductive | Yes — 80% training split |
| CompGCN / NBFNet | 45–65% | Transductive | Yes — 80% training split |
Transductive methods are trained on the held-in portion of the dataset they evaluate on. EasyAtom sees zero drug-disease labels at any stage. The 4.5× improvement over random baseline represents pure causal signal from drug→gene→disease algebra.
| Validation | Dataset | Result | Note |
|---|---|---|---|
| A — PubMed | NCBI PubMed (post-2023) | 36% support (9/25 candidates) | Independent post-corpus evidence for novel candidates |
| B — Broad Hub | Broad Repurposing Hub 2020 | 24/2,222 exact matches | Limited by text-normalization; audit file public |
| C — Hetionet | Hetionet v1.0 CtD edges | 75% Prec@5 (4 mapped pairs) | Low coverage expected: engine outputs novel candidates only |
| D — Broad Hub (mapped) | Broad Hub + INN alias table | R@10=21.3%, Prec@10=90% | 100 unambiguously mapped pairs; primary external benchmark |
325 drug-disease pairs satisfy all three convergence criteria simultaneously: L2 Hamiltonian top-quartile ∩ L7 gap score top-500 ∩ L10 urgency CRITICAL. These represent the highest-confidence novel repurposing hypotheses.
Loratadine → PDE4B → Alzheimer's Disease. Loratadine (second-generation antihistamine, H1 antagonist) shows an anomalous strong association to PDE4B (L2 Hamiltonian score = 1.460, Jaccard gene overlap = 1.00). PDE4B inhibition is a known mechanism for reducing neuroinflammation and amyloid-β accumulation. Zero post-2023 PubMed evidence found = genuinely novel. The drug is safe, cheap, OTC, and crosses the blood-brain barrier.
Every EasyAtom output includes a step-by-step audit trace: drug → target gene(s) → pathway → biological process → disease. Each hop is backed by a triple from the corpus with its source database cited. The complete audit dataset is publicly available at easyatom-engine.web.app/audit/.
Example trace for loratadine → Alzheimer's:
The pipeline produces int8-quantized knowledge shards (L9) deployable on Android via a React Native module. Query time is ~40ms on Samsung Galaxy A16 (no network required). The full C++20 pipeline runs on any x86-64 CPU with 16GB RAM and no GPU.