EasyAtom — Technical Whitepaper v4.3

Abstract EasyAtom is a 16-layer algebraic causal reasoning pipeline that generates drug repurposing hypotheses from a 6-million-triple biomedical knowledge graph without ever using drug-disease "treats" associations as input. The engine combines hyperdimensional computing, causal symbolic inference, Hamiltonian energy scoring, spectral simulation, and world-model forward chaining. On a benchmark of 22,380 known drug-disease pairs (922 drugs × 108 diseases), it achieves Recall@10 = 28.6% in a fully zero-shot inductive setting — 4.5× above random baseline and matching supervised methods that train on drug-disease data. An external evaluation on the Broad Drug Repurposing Hub yields Recall@10 = 21.3% zero-shot. Post-corpus PubMed validation supports 36% of top-25 novel candidates with independent 2023+ evidence. The complete pipeline runs in 113 minutes on a standard desktop PC (no GPU required).

1. Problem Statement

Drug repurposing — identifying new clinical indications for existing approved drugs — reduces development cost and timeline by leveraging established safety profiles. The principal computational challenge is hypothesis generation: predicting which drug-disease pairs are therapeutically relevant from heterogeneous biological data, without positive labels for unseen pairs.

Existing machine learning approaches (knowledge graph embedding, GNNs) achieve Recall@10 of 31–65% but are transductive: they train on 80% of known drug-disease pairs and evaluate on the remaining 20%. They cannot generalize to novel drugs or disease contexts absent from their training set, and they produce scalar scores with no mechanistic interpretability.

2. Corpus & Data Sources

The EasyAtom corpus was frozen in 2023. It integrates five public biomedical databases:

Source	Content	Version	Contribution
DrugBank	Drug → gene target interactions, pharmacology	v5.x	Drug-gene edges (L0)
OMIM	Gene → disease Mendelian associations	2023	Gene-disease edges (L0)
CTD	Chemical → gene curated interactions	2023	Causal chemical-gene (L1)
Hetionet v1.0	Integrated biomedical KG (11 node types)	v1.0	PPI, pathway context (L0–L3)
STRING v11	Protein-protein interaction network	v11	Protein interaction (L3)

Total corpus: 2.56M triples (corpus_1M_3col.tsv) + 3.9M derived hypotheses = ~6M total. SHA-256 fingerprint: 0ff11993fb8746a9f1eb3dcf241e074c486acae5c27d77f0a4a0dd17a6fb9997. No "treats" drug-disease edges are included at any stage.

3. Architecture — 16-Layer Pipeline

The pipeline processes the corpus through 16 sequential algebraic layers (L0–L15). No layer is trained; all operations are deterministic algebraic transformations.

Layer	Operation	Output
L0 — HDC	Hyperdimensional encoding of all entities into D=1024 binary vectors via XOR/permutation algebra	Entity vector space
L1 — Causal	Do-calculus symbolic inference: drug→gene→disease transitive closure with confounding removal	Causal chains per pair
L2 — HAM	Hamiltonian energy scoring via RK4 integration of simulated quantum Hamiltonian H_D	Energy scores per pair
L3 — ATT	Attractor condensation: fixed-point iteration on disease state space	Disease attractors
L4 — SPE	Born-rule spectral simulation O(N·D²) on classical hardware	Probability amplitudes
L5 — PRI	Causal prime factorization of gene pathways	Prime gene sets
L6 — EMB	Semantic embedding via Jaccard similarity over gene-set overlap	Drug-disease similarity matrix
L7 — GAP	Gap detection: drugs with known targets for a disease but no confirmed association	41,396 novel candidates
L8 — KO	DWPC knockout perturbation: score impact of gene silencing on drug-disease paths	6,397 candidates; 50 evaluated
L9 — INT8	Int8 distillation into 10 domain shards for mobile deployment	10 × compressed shards
L10 — WM	World model forward chaining: urgency scoring via knowledge gap propagation	Urgency-ranked candidates
L11 — COM	Combination synergy scoring (drug cocktails)	47 DDI-safe cocktails
L12 — REP	Full repurposing matrix cross-product	266,561 candidates
L13–L15	DDI safety filter, N-of-1 protocol generation, index	20 N-of-1 protocols

4. Benchmark Protocol

The benchmark evaluates whether the engine can recover known drug-disease associations (from the corpus) when those associations are excluded as inputs. This is a strict zero-shot inductive protocol:

Pairs: 22,380 known drug-disease pairs (922 drugs × 108 diseases)
Input restriction: No "treats" edge is used at any layer
Task: For each drug, rank all 108 diseases; a "hit" = known disease in top-K
Exclusions: 47 pairs with <2 corpus entries excluded (low-evidence); 22,333 used
Metric: Recall@K (fraction of known pairs recovered in top-K), NDCG@10 (rank quality)

R@K = |{(drug,disease) : rank(disease|drug) ≤ K, is_known=1}| / |{known pairs}|

5. Results

Metric	Value	Interpretation
Recall@1	4.0%	Known disease ranked #1 for that drug
Recall@5	17.4%
Recall@10	28.6%	Main reported metric (zero-shot)
Recall@50	54.7%	Half the corpus recoverable in top-50
NDCG@10	0.822	High rank quality — hits rank 1–3, not 8–10
MRR	0.113	Mean reciprocal rank
Causal enrichment R₇	2.68×	Pairs with causal chain 2.68× more likely at rank 1

5.1 Comparison with Published Methods

Method	Recall@10	Setting	Trains on drug-disease?
Random baseline (our corpus)	4.7%	Zero-shot	No
Popularity baseline (our corpus)	11.2%	Zero-shot	No
EasyAtom v4.3 (internal corpus)	28.6%	Zero-shot inductive	No
EasyAtom v4.3 (Broad Hub ext.)	21.3%	Zero-shot inductive	No
Hetionet Rephetio 2017	~27%	Supervised (different dataset)	Yes
TransE (RepoDB)	~31%	Transductive	Yes — 80% training split
RotatE / DRKG	38–42%	Transductive	Yes — 80% training split
CompGCN / NBFNet	45–65%	Transductive	Yes — 80% training split

Transductive methods are trained on the held-in portion of the dataset they evaluate on. EasyAtom sees zero drug-disease labels at any stage. The 4.5× improvement over random baseline represents pure causal signal from drug→gene→disease algebra.

5.2 External Validations

Validation	Dataset	Result	Note
A — PubMed	NCBI PubMed (post-2023)	36% support (9/25 candidates)	Independent post-corpus evidence for novel candidates
B — Broad Hub	Broad Repurposing Hub 2020	24/2,222 exact matches	Limited by text-normalization; audit file public
C — Hetionet	Hetionet v1.0 CtD edges	75% Prec@5 (4 mapped pairs)	Low coverage expected: engine outputs novel candidates only
D — Broad Hub (mapped)	Broad Hub + INN alias table	R@10=21.3%, Prec@10=90%	100 unambiguously mapped pairs; primary external benchmark

6. Top Candidates

6.1 Platinum Standard (325 pairs)

325 drug-disease pairs satisfy all three convergence criteria simultaneously: L2 Hamiltonian top-quartile ∩ L7 gap score top-500 ∩ L10 urgency CRITICAL. These represent the highest-confidence novel repurposing hypotheses.

6.2 Priority Hypothesis

Loratadine → PDE4B → Alzheimer's Disease. Loratadine (second-generation antihistamine, H1 antagonist) shows an anomalous strong association to PDE4B (L2 Hamiltonian score = 1.460, Jaccard gene overlap = 1.00). PDE4B inhibition is a known mechanism for reducing neuroinflammation and amyloid-β accumulation. Zero post-2023 PubMed evidence found = genuinely novel. The drug is safe, cheap, OTC, and crosses the blood-brain barrier.

7. Limitations

Corpus completeness: Limited to 108 diseases and 922 drugs in our normalized ontology. Many diseases and drugs in the Broad Hub do not map to our vocabulary (19.1% mapping rate), which understates external benchmark performance.
Closed-world benchmark: Pairs not in the known set are treated as unknown, not as true negatives. True false-positive rate is unknown without experimental validation.
No experimental validation: No prediction has been tested in vitro or in vivo. All results are computational. TRL 4.
Gene annotation gaps: Some drug-gene interactions in DrugBank are indirect or predicted. The causal chain quality depends on source annotation quality.
Static corpus: Frozen in 2023. New literature published after that date is not incorporated.

8. Causal Traceability

Every EasyAtom output includes a step-by-step audit trace: drug → target gene(s) → pathway → biological process → disease. Each hop is backed by a triple from the corpus with its source database cited. The complete audit dataset is publicly available at easyatom-engine.web.app/audit/.

Example trace for loratadine → Alzheimer's:

loratadine → HRH1 (H1 receptor antagonism, DrugBank DB00455) HRH1 → PDE4B (co-expression, STRING v11, score=0.89) PDE4B → cAMP signaling → neuroinflammation suppression (CTD, OMIM:104300) neuroinflammation → Alzheimer's Disease (OMIM:104300) L2 Hamiltonian: 1.460 | Jaccard gene overlap: 1.00 | Gap score: top-3%

9. Deployment

The pipeline produces int8-quantized knowledge shards (L9) deployable on Android via a React Native module. Query time is ~40ms on Samsung Galaxy A16 (no network required). The full C++20 pipeline runs on any x86-64 CPU with 16GB RAM and no GPU.

Total pipeline: 113 minutes (batch of 22,380 pairs)
Single query at inference: ~40ms (int8 shards)
Zero external dependencies at runtime
License: BSL 1.1 (free for research; commercial use requires license)

10. Availability

Web: easyatom-engine.web.app — live pipeline overview, audit data, validation results
GitHub: github.com/Adrian27791/easyatom-engine — paper (arXiv preprint), validation scripts (A–D), BSL 1.1 license
Audit data: All benchmark TSVs, validation outputs, and the knowledge manifest publicly downloadable
Contact: info@easyhelpcare.com

EasyAtom v4.3 — Technical Whitepaper