Implementation Plan
Research prototype — proving the core thesis with real code, real payments, and real numbers.
Project Philosophy
This is a research prototype, not a production system. The goal is to prove the core thesis — "Lightning micropayments can coordinate quality-verified compute contributions" — with real code, real payments, and real numbers. Start small, validate incrementally, publish results.
Together AI proved decentralized training could work at meaningful scale — then abandoned it because centralized infrastructure was a better business. The thesis of this project is that Lightning micropayments change the equation: per-contribution payment granularity, near-zero transaction costs, and no token overhead make the economics work where token-based systems failed.
The protocol has two modes that share the same L402 infrastructure: training coordination (gradient exchange with quality-proportional payment) and autoresearch bounties (AI agents compete to optimize any quantifiable metric, paid per validated improvement). Training is the hard technical problem that proves the protocol. Autoresearch bounties are the scalable product — they require no GPU, run on any hardware, and have an essentially unbounded addressable market. Both are developed in parallel.
Two Tracks, Shared Infrastructure
| Track A: Training | Track B: Autoresearch | |
|---|---|---|
| What | Decentralized model training with gradient exchange | AI agents compete to optimize anything with a metric |
| Hardware | GPU / Apple Silicon (16+ GB VRAM) | Any computer that can run a coding agent |
| Coordination | Synchronized ~70s rounds, SparseLoCo compression | Fully independent — agents never coordinate |
| Verification | Gradient quality scoring (loss delta) | Deterministic: did the held-out metric improve? |
| Shared infra | L402 payment gating, hold invoice escrow, coordinator validation, Lightning settlement | |
| Phases | 0 → 1 → 2 → 3 | B0 → B1 → B2 (starts at Phase 1) |
Track B starts as soon as Phase 1’s L402 infrastructure is working. The bounty coordinator is a simpler application of the same payment flow — no gradient compression, no model checkpoints, just "submit a diff, validate against held-out eval, pay for improvements." This means the autoresearch product can ship months before multi-peer training is battle-tested.
INFRASTRUCTURE
Agent Collaboration: l402-hub
l402-train is the first project where agents build the protocol that pays them.
l402-hub is the development infrastructure for the project itself — a “GitHub for Agents.” Inspired by Karpathy’s AgentHub (git DAG + message board + per-agent identity), but adds what AgentHub lacks: validation before merge and payment for accepted work.
The key insight: the task format maps 1:1 to the bounty specification. Tasks = bounties. Validation = coordinator eval. Merge = hold invoice settlement. Using the bounty protocol to build itself provides direct feedback on the protocol design.
| l402-hub | Bounty Protocol |
|---|---|
hub task add | POST /bounties |
hub task claim | GET /bounty/{id} (L402-gated) |
hub task submit | POST /bounty/{id}/submit |
hub validate | Coordinator runs held-out eval |
hub merge | Hold invoice settles |
hub reject | Hold invoice cancels |
Any AI agent can participate: discover tasks, claim work in isolated git worktrees, submit contributions, pass deterministic validation, and merge to main. No accounts, no permissions — just verified contributions and sats. See l402-hub.ai or the agent collaboration research for the full architecture.
TRACK A: TRAINING
Phase 0: Local End-to-End Loop ✓ COMPLETE
Goal: Single-machine simulation running the complete protocol loop: local training → gradient compression → validation scoring → payment settlement. All on the MacBook with regtest Lightning.
Why this first: Before involving any networking, peers, or real money, prove the software architecture works end-to-end. Get a tight eval loop running fast.
Components
sparseloco.py— SparseLoCo compression in MLX- Top-k sparsification (k=64 per chunk of 4096)
- 2-bit quantization of selected values
- Index encoding (uint16 chunk-local indices)
- Error feedback buffer (decay=0.95)
- Port from PyTorch reference (
github.com/tplr-ai/SparseLoCo) - Measured: 56× compression ratio (uint16 indices + 2-bit codes + float16 scales per chunk). Lower than Covenant-72B’s 146× due to uint16 vs 12-bit packed indices at 0.5B scale.
data.py— Dataset loading- Download TinyStories (
roneneldan/TinyStories) and convert to JSONL - Split: 2.1M train rows, 1K–2K held-out for validation
- Download TinyStories (
validator.py— Gauntlet-style loss scoring- Take compressed gradient, decompress, apply to model checkpoint
- Measure loss on held-out validation batch before and after
- Output: quality score (loss delta) normalized against baseline
- Score on 2–3 disjoint batches to detect validation-set overfitting
- Note: Metal GPU non-determinism causes ~1e-5 variance in forward passes — use 1e-4 tolerance threshold
economics.py— Reward calculationreward = base_rate × quality_score × normalization_factor- Maps validation score to sats payment amount
- Regtest Lightning —
docker-compose.yamlwithbitcoin/bitcoin:28.1+ twolightninglabs/lnd:v0.20.0-betanodes- Coordinator node + simulated peer node
- Channel setup script: fund wallets, open channel (1M sats), mine confirmations
- Test: issue hold invoice → pay → settle on validation pass / cancel on fail
lnd_client.py— Python LND REST client- REST API via urllib (no compiled protos needed — simpler for Phase 0)
- Hold invoice lifecycle:
AddHoldInvoice,SettleInvoice,CancelInvoice - Use
SendPaymentV2(notpayinvoicewhich hangs for hold invoices)
protocol_sim.py— Single-machine protocol loopfor round in range(N): 1. Peer trains locally for K steps (MLX, K=10 to start) 2. Peer compresses pseudo-gradient (sparseloco.py) 3. Coordinator creates hold invoice (preimage kept secret) 4. Peer pays hold invoice (funds locked) 5. Coordinator validates (validator.py) → quality_score 6. If quality_score > threshold: settle hold invoice (preimage revealed) 7. Else: cancel hold invoice (funds return to peer immediately) 8. Log: round, loss, quality_score, payment_settled, compression_ratio
Economic Benchmarking
Phase 0 also establishes baseline economics. Measure actual performance and power draw against the break-even analysis:
- Training throughput (tok/s) on Apple Silicon (Phase 0 dev hardware); validate against Mac Mini M4 Pro targets (150–200 tok/s on 3B) in Phase 2
- Real power draw during sustained training
- Sats/hr break-even at measured power
- Validation compute overhead as % of training compute (target: <5%)
Validates
- SparseLoCo compression works on MLX (not just PyTorch/CUDA)
- Validation oracle produces meaningful quality scores
- Hold invoice conditional settlement works mechanically
- Real numbers for: compression ratio, validation compute cost, payment latency
- Economic viability: are the break-even numbers realistic?
Dependencies
- Docker (for regtest bitcoind + LND nodes)
- MLX + mlx-lm ≥0.31 (Python API for training)
- datasets (HuggingFace TinyStories)
Results
Full protocol loop runs end-to-end on Apple Silicon. 10 rounds of train → compress → validate → settle/cancel completed successfully.
| Metric | Target | Measured |
|---|---|---|
| Compression ratio | 73–100× | 56× (uint16 index overhead at 0.5B scale) |
| Round time | < 30s | 31s avg (K=10, batch=4, seq=512) |
| Acceptance rate | — | 8/10 rounds accepted |
| Training loss | Measurable decrease | 1.94 → 1.84 over 10 rounds |
| Validation scores | Positive for good gradients | 0.001–0.047 (diminishing returns) |
| Total sats earned | — | 141 sats / 10 rounds |
Key Findings
- Inner LR = 1e-5 (not 2e-4). AdamW without warmup on a pre-trained 0.5B model causes catastrophic loss explosion at higher rates. First-step Adam normalizes gradients, so effective step ≈ lr regardless of gradient magnitude.
- Validation must use outer_lr (1.0), not inner_lr. The pseudo-gradient is an accumulated K-step change; validating at inner_lr produces near-zero scores indistinguishable from noise.
- Per-round data shuffling is critical. Without it, every round trains on identical batches → overfitting after 2 rounds (2/10 accepted). With shuffling: 8/10 accepted.
- Validator correctly rejects overfitting. When compressed gradient worsens validation loss, score goes negative and the gradient is reverted. Core safety mechanism works as designed.
- 56× (not 73–100×) because uint16 indices add more overhead per chunk than Covenant-72B’s 12-bit packed encoding. Ratio improves at larger model scales.
- MLX parameter trees contain lists, not just dicts. All tree-walking code must handle both — an undocumented gotcha.
Phase 1: L402-Gated HTTP Exchange ✓ COMPLETE
Goal: Split coordinator and peer into separate processes communicating over HTTP with L402 payment gating. Still on one machine, but real HTTP and real L402 flows.
Progress
- ✓ L402 middleware (
l402_train/l402.py) —L402Manager(server),L402Client(peer),SimpleMacaroon(HMAC-SHA256),require_l402()FastAPI dependency. Standard invoice auth (preimage proof) and hold invoice auth (macaroon-only, LND status verification). 27 tests passing. - ✓ LND client extensions —
add_invoice()for standard invoices,send_payment_stream()for hold invoice payments that return on IN_FLIGHT (fixes Phase 0 deadlock wheresend_payment_v2blocks until settlement). - ✓ Coordinator service (
l402_train/coordinator.py) — FastAPI app on port 8402.GET /checkpoint(standard L402),PUT /gradient(hold L402 + Gauntlet validation + settlement).asyncio.Lockwith HTTP 409 for concurrent submissions. Safetensors checkpoints. Gradient deserialization viaBytesIO. 9 tests passing (real MLX model + real compression). - ✓ Peer client (
l402_train/peer.py) —PeerClientclass with status-first checkpoint sync, K-step training, SparseLoCo compression, L402 gradient submission.compressed_to_bytes()for BytesIO serialization. CLI: train/status/balance subcommands. 11 tests passing. - ✓ Integration tests — 10 payment flow tests (standard + hold invoices, L402 standard + hold flows) + 6 E2E tests (status, reward schedule, checkpoint download with L402, gradient submission with hold invoice + validation). All verified against real regtest LND. Found and fixed 3 bugs:
lookup_invoicebase64 encoding,send_payment_streamhold vs standard timing, LND American spelling ("CANCELED").
Components
coordinator.py— FastAPI service with L402 middlewarePUT /gradient— L402-gated gradient submission (peer pays submission fee)GET /checkpoint— L402-gated checkpoint downloadGET /reward-schedule— public endpoint showing current bounty rates- L402 verification is local:
sha256(preimage) == payment_hash— no LND call during verification - Validation runs server-side after gradient upload
- Hold invoice settled on validation pass, cancelled on fail (funds return immediately)
peer.py— Client with native L402 payment handling- Training loop → compress → submit gradient → receive payment (or not)
- Built-in
L402Client: detects 402 → pays invoice via LND → retries withAuthorization: L402header - Status-first checkpoint sync (only downloads if coordinator round advanced)
- CLI:
train,status,balancesubcommands
- L402 implementation (complete)
- Native FastAPI dependency injection — no Aperture proxy needed
- Two auth modes: standard (preimage proof) for access fees, hold (macaroon-only + LND status check) for submission deposits
- Pricing: ~100 sats submission fee for
PUT /gradient, ~50 sats forGET /checkpoint - HMAC-SHA256 signed JSON macaroons with round + endpoint + expiry caveats
L402 Implementation Notes
- Native L402 in FastAPI — no Aperture reverse proxy.
require_l402()dependency injection with HMAC-SHA256 macaroons. Simpler deployment, no extra process. - Hold invoice auth — hold invoices don’t reveal the preimage to the payer until settlement. The peer proves payment with macaroon-only auth; the coordinator verifies via LND
lookup_invoice()status check. This was validated against real LND and is the key insight that three independent architecture reviews missed. - lightning-mcp-server provides 18 read-only monitoring tools (check balance, list channels, query invoices) — useful for coordinator observability.
Validates
- L402 works for gradient exchange (the core protocol interaction)
- Payment latency is acceptable within the training round window (~30s for 0.5B, ~70s target at scale)
- L402 middleware handles the full payment flow end-to-end
Phase 2: Two-Machine Proof of Concept
Goal: Run the protocol across two separate machines over the real internet with real (small) Lightning payments.
Components
- Coordinator on Hetzner VPS
- Deploy coordinator service (FastAPI + native L402) + LND (Neutrino light client)
- Channel capacity: minimal for testing (100K–1M sats, ~$100–$1000)
- Primary test peer: Mac Mini M4 Pro 24 GB
- MLX training, LND light client, direct payment channel to coordinator
- The sweet spot hardware: $799, 30–50W, 150–200 tok/s on 3B model
- Real Lightning payments: submit gradients, receive rewards
- Stretch: RTX 4090 peer (CUDA path)
- PyTorch + CUDA training, validates cross-framework gradient exchange
- 500+ tok/s on 3B, 450W — tests the power/performance tradeoff
- Testnet → Mainnet
- Start on Bitcoin testnet (free, no real money)
- Move to mainnet when stable (budget: ~$100–500)
Economic Validation
- Measure real sats earned per hour per hardware tier
- Compare to Vast.ai market rates (RTX 4090 hosts earn 158–243 sats/hr equivalent)
- Calculate coordinator cost per peer per day at target payment rates
- Answer: "At 200 sats/hr per peer, 100 peers = $14/hr. Is this sustainable for the training value produced?"
Validates
- Protocol works over real internet
- Real Lightning payment latency over real network hops
- Gradient upload/download times at realistic bandwidth
- Channel management and rebalancing with real channels
- Economics: are actual sats/hr in the "worth my time" range (300+ sats/hr)?
Deliverable: conference demo
Phase 3: Multi-Peer Simulation + Byzantine Testing
Goal: Simulate 3–5 peers submitting varying quality gradients + 1 real peer on MacBook. Test incentive mechanics and Byzantine resistance.
Verification of untrusted computation is the hardest unsolved problem in decentralized training. Gensyn's Verde (probabilistic proof-of-learning) has been in development since 2022 and remains in testnet. Prime Intellect's TOPLOC works but is narrow (RL rollouts only). l402-train's approach — deterministic loss scoring on held-out data — is simpler and immediately testable, but must prove it catches real attack vectors.
Simulated Peer Profiles
- Honest peer — real gradients from actual training
- Free-rider — random/noise gradients (zero compute)
- Plagiarist — copies another peer's gradient
- Poisoner — adversarial gradients designed to degrade model
- Mediocre — real gradients from undertrained model (low quality but honest)
- Stale — submits gradients computed on an outdated checkpoint (desync attack from Gauntlet analysis)
Test Questions
- Does quality-proportional payment correctly reward good and reject bad?
- Do submission fees effectively prevent spam?
- Does validation catch free-riders and poisoning?
- What is the validation compute overhead relative to training?
Deliverable: technical paper with empirical results — real Lightning payments + real gradient validation + Byzantine resistance is novel. Nobody has demonstrated this.
TRACK B: AUTORESEARCH BOUNTIES
Phase B0: Bounty Runner Framework
Goal: Build the bounty coordinator as a second mode of the existing coordinator service. Same L402 infrastructure, different task type.
Components
bounty_coordinator.py— FastAPI endpoints with same L402 middlewareGET /bounties— public listing of active bountiesGET /bounty/{id}— L402-gated baseline download (code + public eval set)POST /bounty/{id}/submit— submit improvement (diff + claimed score)- Validation: apply diff to baseline, run eval on held-out set, score improvement
- Hold invoice created at submission, settled proportional to improvement
bounty_agent.py— Reference agent client- Downloads bounty baseline via L402
- Runs autoresearch loop locally (Karpathy pattern: edit → eval → keep/discard)
- Submits improvements to coordinator
- Works with any coding agent backend (Claude Code, Codex, local models)
Why This Is Simpler Than Training
- No gradient compression (SparseLoCo not needed — submissions are code diffs)
- No model checkpoints (coordinator stores eval framework, not multi-GB models)
- No synchronization (agents work independently, submit whenever ready)
- Validation is running an eval script, not forward-pass loss computation
- Same hold invoice escrow, same L402 gating, same coordinator architecture
Validates
- L402 payment flow works for bounty submissions (not just gradient exchange)
- Held-out validation catches naive metric gaming
- Hold invoice economics make sense for bounty-scale payments (500–50,000 sats)
Phase B1: First Live Bounties
Goal: Post real bounties with real sats, have real agents compete. Prove the two-sided market works.
First Bounties
- Prompt optimization — improve a classification system prompt against a labeled eval corpus. Clear metric (accuracy), fast eval (<30s), bounty: 50,000–100,000 sats
- Regex pattern improvement — improve detection patterns against a test corpus. Composite metric (detection_rate × 0.7 + (1 - false_positive_rate) × 0.3), bounty: 25,000–50,000 sats
- Open bounty — any target with a quantifiable metric and fast eval (<5 minutes). Posted publicly to attract external agents
Anti-Gaming Validation
- 80/20 public/held-out eval split with commit-reveal on held-out set hash
- Canary probes in public eval set (known-answer inputs that differ in held-out)
- Temporal stability: 20% holdback released after 48-hour re-evaluation
- Diff size limits to prevent wholesale file replacement
Validates
- Real agents can discover and compete for bounties
- Anti-gaming measures catch metric hacking in practice
- Bounty economics: are improvements worth the sats paid?
- Agent diversity: do different agents find different improvements?
Deliverable: working bounty marketplace with real payments — standalone product, no GPU required.
Phase B2: Multi-Sponsor Marketplace
Goal: Open the bounty coordinator for external sponsors to post their own bounties. Two-sided marketplace: sponsors post bounties, agents compete.
Components
- Sponsor onboarding
- Sponsor deposits bounty pool via Lightning (held in coordinator channel)
- Uploads target files, eval script, public eval dataset
- Coordinator generates held-out eval set or accepts sponsor-provided held-out hash
- Public bounty board
- Browse active bounties with: description, metric, bounty amount, deadline, current best score
- Leaderboard per bounty (anonymized agent IDs + scores)
- Historical data: completed bounties, total sats paid, average improvements
- Coordinator economics
- 5–10% fee on bounty payouts (covers validation compute + infrastructure)
- L402 access fees on baseline downloads (covers bandwidth)
- Self-sustaining business model independent of training revenue
Deliverable: open-source bounty marketplace — the "SETI@home for software optimization" that Karpathy envisioned, coordinated by Lightning.
Target Hardware
Training hardware requirements are based on the consumer hardware guide and economics analysis. Autoresearch bounties have no minimum hardware — any computer that can run a coding agent (Claude Code, Codex, or a local model) can compete.
| Tier | Hardware | Model Range | tok/s (3B) | Power | Break-even* |
|---|---|---|---|---|---|
| Entry | MacBook Air M3 16 GB | 0.5B–1B | 40–60 | 20 W | 5 sats/hr |
| Sweet spot | Mac Mini M4 Pro 24 GB | 0.5B–7B | 150–200 | 40 W | 9 sats/hr |
| Workhorse | Mac Studio M2 Ultra 192 GB | 0.5B–30B | ~475 | 90 W | 21 sats/hr |
| Power | RTX 4090 system (24 GB) | 0.5B–13B | 500–628 | 450 W | 103 sats/hr |
| Not viable: Raspberry Pi, AMD RX 580 and older, 8 GB machines | |||||
*Electricity-only break-even at US average $0.16/kWh, BTC = $70,000
Competitive Landscape
Based on the landscape survey of 12 projects:
What exists: Only Prime Intellect (INTELLECT-1/2/3) and Together AI (GPT-JT, before pivoting) have trained competitive models via decentralized infrastructure. Bittensor is an inference marketplace with empirically demonstrated stake-weighted rewards. Gensyn has been in testnet for 3+ years. Every project except Hivemind requires a custom token.
Where l402-train fits: The only protocol using Bitcoin Lightning for payment coordination. No token, no staking, quality-proportional rewards via hold invoices. The tradeoff is starting with a single coordinator and small models (0.5B–3B), which is the honest scope for a research prototype. See the L402 ecosystem survey for how the protocol extends L402 bidirectionally.
What to Skip for Prototype
| Whitepaper Feature | Skip? | Why |
|---|---|---|
| DLC-bound settlement | Yes | Hold invoices sufficient for PoC |
| Federated multi-validator | Yes | Single coordinator fine; deterministic replay is what matters |
| 72B scale | Yes | 0.5B–3B on MLX. Proving the mechanism, not training a model |
| Heterogeneous SparseLoCo | Yes | Single-tier peers only |
| USDT (Taproot Assets) | Yes | Sats-only for prototype |
Key Risks
SparseLoCo on MLX— Resolved. Ported successfully from PyTorch reference. Key adaptation: numpy for scatter (MLX lacks in-place scatter),mx.eval()after accumulator mutations.Aperture custom validation— Resolved. Native FastAPI L402 middleware handles validation-before-settlement directly. No Aperture needed.- LND on VPS — 4GB RAM may be tight. May need a larger instance or run LND on local hardware instead
- MLX scale gap — 0.5B proof of concept is fine, but gap to publishable 7B+ results requires renting GPU time
Deliverables Summary
| Phase | Track | Deliverable | Publishable? |
|---|---|---|---|
| 0 | Training | Single-machine simulation with economics data | Complete — 8/10 acceptance, 56× compression, 31s rounds |
| 1 | Training | L402-gated gradient exchange | Complete — 115 tests, verified against regtest LND |
| B0 | Autoresearch | Bounty runner framework | Blog post / tweet thread |
| 2 | Training | Two-machine PoC over real internet | Conference demo |
| B1 | Autoresearch | First live bounties with real sats | Open-source product launch |
| 3 | Training | Multi-peer + Byzantine resistance | Technical paper with empirical results |
| B2 | Autoresearch | Multi-sponsor bounty marketplace | Standalone product |
| Hub | Infrastructure | Agent collaboration tool (l402-hub) | Complete — deployed to l402-hub.ai |