summaryrefslogtreecommitdiff
path: root/doc/bom-radar-rollout.md
blob: 5c0a8a25c2ac917ba30b59a27a1977aca5f7e958 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "BOM radar rollout — Sydney and Brisbane on cremonde"
date-created: "2026-06-09"
type: "memoir"
status: "complete"
author:
  - "st33v"
  - "claude-opus-4-7"
model:
  - "claude-opus-4-7"
tldr: "Took the bomsynoptic side-spec for IDR713 from a draft to a live, looping APNG on radar.pestrel.com in one session — and added Brisbane (IDR663) at the end. Most of the work was wiring, not coding; the spec turned out to be the easy part."
chat-url: "https://claude.ai/chat/..."
session-kind: "infrastructure"
side-quests: 1
reader-targets:
  - "st33v"
  - "claude-code"
  - "scrivener"
related:
  entities:
    - "cremonde"
    - "pestrel.com"
  concepts:
    - "static layers + dynamic layers"
    - "lower plate / upper plate"
  songs: []
tags:
  - "bom"
  - "radar"
  - "deployment"
  - "ffmpeg"
  - "systemd"
provenance:
  - "doc/bom-radar-spec.md (v0.1, drafted same day)"
---

# BOM radar rollout — Sydney and Brisbane on cremonde

## TL;DR

A spec drafted earlier on 2026-06-09 (`doc/bom-radar-spec.md`) called for adding a 6-minute rain-radar loop to the existing `bomsynoptic` deployment, beginning with Sydney (IDR713) and parameterising on product code so other radars could follow. By end of session both Sydney and Brisbane were live at `radar.pestrel.com/sydney/` and `/brisbane/`, served as proper animated APNGs from a 6-minute systemd timer. The spec was correct in every architectural choice; the friction was entirely in the plumbing — git remote shuffling, file ownership, ImageMagick lying about APNG support, and a palette-mismatch in ffmpeg's apng encoder.

## Context

`bomsynoptic` was already deployed on cremonde: a shell script on a 6-**hourly** systemd timer fetching the BOM MSLP synoptic PDF, rasterising it to PNG, and serving the latest as a single `<img>` on `pestrel.com`. Spec proposed a second product class — rain radar — on a 6-**minute** cadence, layered (transparencies + echo frames), with a rolling buffer of the last six frames as a loopable animation.

The spec opened with four §12 questions to the implementing agent. Those were worked through against the existing code first, then implementation proceeded.

## Spec triage — the §12 answers

In order:

1. **Scheduler reuse:** no shared in-code scheduler exists — each product is its own systemd timer. The radar pattern mirrors synoptic's: `radar.{service,timer}` + `radar-retry.{service,timer}` with `OnFailure` chaining, just at `OnUnitActiveSec=6min` instead of `OnCalendar=*:10:00`.
2. **Asset convention:** existing pattern is `/srv/www/pestrel/synopticLatest.png` served by nginx from `/srv/www/pestrel`. Decided radar gets its own subdomain (`radar.pestrel.com`), own web root (`/srv/www/radar/`), own working dir (`/var/lib/radar/<id>/`), own `/opt/radar/` for scripts. Mirroring the synoptic shape, but isolated.
3. **Pillow vs ImageMagick:** existing stack is pure shell + curl + magick + gs. Staying in shell+IM keeps the footprint identical. Recommended this; st33v confirmed.
4. **Loop format:** existing front-end is one `<img>` tag, no JS. APNG drops in unchanged; sequence+manifest would have meant introducing JS. Picked APNG.

The §10 copyright concern (BOM FTP is personal-use, not commercial-republish) was flagged and dismissed: *"No one is looking at the site right now."* Attribution to BOM is implicit by content; no formal gating.

## Implementation — first pass

Wrote in roughly this order:

- `radarFetch.sh`: parameterised on `RADAR_ID` (defaulting `IDR713`); refresh transparencies on 24h TTL with `.last_refreshed` marker; build two cached plates (lower = background + topography + optional feature overlays; upper = range + locations); fetch top-N echo frames by lexical sort; composite each as `lower → echo → upper → legend@SE`; evict frames outside the rolling buffer by set membership.
- Four systemd units (`radar.service`, `radar.timer`, `radar-retry.service`, `radar-retry.timer`).
- `nginx/radar.pestrel.com.conf` with `Cache-Control: no-cache` on `.apng`.
- A black-page `radar.index.html` with the single `<img>`.
- Extended `setup.sh` to provision radar dirs/units/scripts/index. Extended `deploy.sh` to take a `synoptic|radar` argument.

DNS: st33v added an `A` record for `radar` → 139.162.32.70 (cremonde) alongside the apex and `www`.

## Git remote shuffle

Existing remote was github (`f3rr3t/bomSynoptic`). User wanted cremonde as origin. First attempt set the URL to `cremonde:/home/git/bomSynoptic.git` — but cremonde's ssh alias logs in as `st33v`, and the bare repo is owned by user `git`. Hit two consecutive failures:

1. **"dubious ownership"** — git's anti-tampering check. Added `safe.directory` exception.
2. **"unable to create temporary object directory"** — st33v can't write into git-owned dirs. Real fix: change the URL to `git@cremonde:/home/git/bomSynoptic.git`, matching how the other ~30 bare repos in `/home/git/*.git` are reached. *"You can — I just set the remote wrong."* Lesson logged: "use ssh cremonde" can mean two things, host alias and user; check the existing convention rather than guess.

Reverted the safe.directory entry as no longer needed.

## Deployment — three real bugs

The deploy was bumpy and instructive. In order encountered:

### Bug 1 — empty deployment

After provisioning everything, ran setup.sh on cremonde and was told *"setup.sh did not install the radar units."* The clone on cremonde was at `c132b39` — predating all the radar work, because the radar files had never been committed or pushed. Stupid but easy fix: stage, commit (`f058e83`), push, pull on cremonde, re-run setup.

### Bug 2 — ImageMagick silently writing single-frame PNGs

First successful service run produced `/srv/www/radar/idr713-loop.apng` at 35,106 bytes — identical to `frame.00.png`. `file` confirmed: plain PNG, not animated. ImageMagick on Arch is built against an upstream libpng that lacks the APNG patch — so `magick -delay 50 -loop 0 frame*.png out.apng` happily produces a single-frame PNG with no warning. Switched to `ffmpeg`'s `concat` demuxer + `apng` muxer (also unlocking variable per-frame durations cleanly via the demuxer text format). Added `ffmpeg` to setup.sh's pacman line. st33v's reaction was just "but ffmpeg is useful" — fair, took the install.

### Bug 3 — ffmpeg palette mismatch

ffmpeg succeeded silently in test but exited 255 in production with:

> `Input contains more than one unique palette. APNG does not support multiple palettes.`

The composited frames were 8-bit palettized PNGs, and each frame's palette differed slightly. ffmpeg's apng encoder won't accept that. Single-character fix: `-vf format=rgba` in the ffmpeg invocation. The file went from a stuck 35 KB single-frame PNG to a proper 91 KB 7-frame RGBA APNG (six frames + the trailing duplicate the concat demuxer requires).

Important meta-observation: between bug 2 and bug 3, the user noticed *"seems stuck on the first image"*. Without that, the failed `systemctl` retries would have just kept hammering the BOM FTP every two minutes and the published file would have remained stuck at 13:05. The error was in the journal but nobody was watching the journal.

## Brisbane

Once Sydney was confirmed working, st33v provided the Brisbane code (IDR663, Mt Stapylton) and asked the design question: select between, or show both? Considered three shapes (subdirs, subdomains, stacked on one page), recommended subdirs (`/sydney/`, `/brisbane/`) with cross-links — minimal new infra, no JS, matches the existing one-img-per-page aesthetic. st33v picked that.

Implementation: changed `radar.service` to have two `ExecStart` lines (one per `RADAR_ID` arg), prefixed both with `-` so one radar's outage doesn't skip the other. Added `radar.sydney.html` and `radar.brisbane.html` (corner cross-link nav) and rewrote `radar.index.html` as a centred two-button landing. setup.sh provisions the subdirs and installs all three pages.

Deployed clean. Both APNGs publishing on the same 6-minute timer.

## Resolutions

- **Subdomain not subpath.** radar.pestrel.com instead of pestrel.com/radar — clean nginx separation, independent evolution.
- **APNG over GIF, despite GIF requiring no new deps.** st33v's "ffmpeg is useful" weighed against the GIF size penalty.
- **Sequence+manifest dropped.** Would have meant the first JS in the project. APNG keeps the black-page-one-img idiom intact.
- **Per-radar invocation of one script over a template service.** `ExecStart=` × N with `-` prefix is simpler than `radar@.service` template units for the current scale of two.
- **TLS deferred.** Plain HTTP. To revisit when adding certbot to cover both pestrel.com and radar.pestrel.com.

## Operational learnings

- **Arch ImageMagick has no APNG writer.** It does not warn; it just writes a single-frame PNG. If APNG matters, use ffmpeg or apngasm.
- **ffmpeg's apng muxer rejects palette mismatches.** Composite chains that go through palette-PNG intermediates need `-vf format=rgba` to be safe.
- **`git remote = git@host:path` for cremonde bare repos.** Not `host:path`, which logs in as st33v and can't write the git-owned object dirs.
- **systemd `ExecStart=-...` prefix** ignores the exit code of that one line — useful when a service runs several independent jobs and you want partial-success semantics.
- **The `concat` demuxer wants the last file repeated** without a `duration` line, otherwise the final frame's duration is ignored.
- **`Cache-Control: no-cache` on `.apng`** matters when the file updates every 6 minutes and the URL stays stable.

## Cheat sheet — adding a third radar

1. Confirm the IDR code (lookup BOM's radar codes page).
2. Add `ExecStart=-/opt/radar/radarFetch.sh IDRxxx` to `systemd/radar.service` and `radar-retry.service`.
3. Write `radar.<city>.html` (copy of an existing one, swap the apng filename and the cross-link target).
4. Add `install -d /srv/www/radar/<city>` and the html install line to `setup.sh`.
5. Update `radar.index.html` to add a third button (and rebalance the gap).
6. Commit; push; pull on cremonde; `sudo ./setup.sh`.
7. `sudo systemctl start radar.service` to publish immediately rather than waiting up to 6 min for the next timer fire.

Everything else — directories, naming, compositing — is identical across radars.

## Appendix A — side-quest: temp-file race hardening

*Surfaced when a manual `/opt/radar/radarFetch.sh` invocation collided with a service-triggered run at 14:05:31. Both wrote to the same `loop.apng.tmp`; one won, the other lost with `install: cannot stat …`.*

In production only the timer should invoke the script, so this race is unlikely under normal operation. But the failure mode is silent corruption rather than a clean fail, which makes it worth hardening:

- **Option a:** `mktemp` per invocation. `TMP_APNG=$(mktemp "${OUT_DIR}/loop.XXXXXX.apng")`, then `trap 'rm -f "$TMP_APNG"' EXIT`. Minimal change; each invocation has its own temp.
- **Option b:** `flock` over the whole script. `exec 9>/var/lock/radar-${RADAR_ID,,}.lock; flock -n 9 || exit 0`. Cleaner — prevents concurrent runs entirely, including the spec's stated "skip a poll cleanly if the previous one is still running" requirement.

Option b is closer to what the spec asked for. Worth doing.