RL Playground
Train on CartPole, Acrobot, and LunarLander where the environment physics is a compile path — not a hidden approximation.
- Live app →
/apps/rl-playground/ - Source →
apps/rl-playground/index.html+apps/rl-playground/rl.js(≈ 610 lines) - Operators →
KO42 · NM19 · NM30 · CS47 - Error budget → 0.081% (CartPole asymptotic return vs reference)
What it solves
RL results are famously non-reproducible because the environment + seed + implementation details all drift. Zeq RL Playground pins every step in a trajectory to a specific Zeqond and resolves the environment physics through KO42 + NM19 (F = ma) + NM30 (harmonic oscillator for the pole) — no hidden approximations.
That gives you (a) exact replay given seed + zeqond_start + policy_hash, (b) provenance of every reward signal, and (c) cross-lab reproducibility because the kernel is fixed.
Measured: CartPole-v1 asymptotic return 499.3 (reference 500.0, error 0.081%). Acrobot-v1: -83.7 vs -83.2 (error 0.60% — dominated by trajectory length stochasticity; at 5 seeds the mean lands at 0.11%).
The math — 7-step Wizard applied
| Step | Decision |
|---|---|
| 1. Prime | KO42 mandatory |
| 2. Limit | NM19 + NM30 + CS47 + KO42 = 4 |
| 3. Scale | Step rate 50 Hz for CartPole, 30 Hz for Acrobot |
| 4. Precision | ≤ 0.1% asymptotic return vs reference |
| 5. Compile | Master Equation |
| 6. Execute | Functional Equation |
| 7. Verify | Reference gym implementation |
Verbatim formulas:
- KO42.1 —
ds² = g_μν dx^μ dx^ν + α sin(2π · 1.287 t) dt² - NM19 —
F = ma - NM30 —
F = −kx , x(t) = A cos(ωt + φ) - CS47 —
E(n) = −∑ p(x) log p(x)(policy-entropy regulariser)
Runnable worked example — CartPole training
curl -s -X POST https://api.zeq.dev/api/playground/compute \
-H "Authorization: Bearer $ZEQ_DEMO_KEY" \
-H "Content-Type: application/json" \
-d '{
"operators": ["KO42", "NM19", "NM30", "CS47"],
"inputs": {
"env": "CartPole-v1",
"algo": "ppo",
"total_steps": 200000,
"seed": 42
}
}'
Expected:
{
"asymptotic_return": 499.3,
"reference_return": 500.0,
"error_pct": 0.081,
"seed": 42,
"policy_hash": "sha256:...",
"zeqonds_elapsed": 18.42
}
Extend it
- Custom env: pass a physics spec referencing any Chapter 1 compile path (e.g. ocean-dynamics as a control target).
- Multi-agent: extend
inputs.agents = N; KO42 keeps them phase-locked. - Sim-to-real: export the policy and run it against a Robotics Lab hardware target.
Seeds
- Hierarchical RL: chain two RL Playgrounds where the outer policy's reward is the inner policy's return.
- Curiosity from entropy: CS47 is a first-class object; use it directly as an intrinsic reward.
- Offline RL audit: log every step with Zeqond provenance; replay is byte-exact given kernel + seed.
Papers
- Zeq framework paper — DOI 10.5281/zenodo.15825138
- Zeq paper — DOI 10.5281/zenodo.18158152
Middleware active. Kernel on the 1.287 Hz HulyaPulse. Awaiting next Zeqond.