Runs & monitoring
Every Nomaflow run leaves a trail: one row in ly2_job_runs, one row per step in ly2_step_runs, every log line in ly2_job_logs. The Settings → Jobs page is where this trail is browsed, filtered and streamed live.
This page covers the run history table, the per-run detail drawer, the live log tail, the abort flow and the retention policy.
At a glance
The runs table
| Column | Source |
|---|---|
| Run id | The unique run_id (e.g. r-1842). Clickable — opens the detail drawer. |
| Job | Job name. The chip filter at the top narrows to one job. |
| Status | Coloured pill — see statuses below. |
| Started | When the run began (local time of the operator). Sortable. |
| Duration | Started → ended (or "running" with a live counter for in-flight runs). |
| Trigger | cron, manual, api, cli — and on hover the user identifier when manual. |
The table defaults to the last 7 days, sorted by Started descending. The toolbar exposes:
- A Job dropdown — narrow to one job.
- A Status dropdown — multi-select (
running,succeeded,failed,aborted,skipped,partial-success). - A Date range picker — week / month / custom.
- A ↻ Refresh button — also auto-refreshes every 5 s while at least one run is
running.
Statuses
| Status | Meaning | Final? |
|---|---|---|
running | The run is in flight. | No |
succeeded | Every step completed without error. | Yes |
failed | A step exhausted its retries and stopped the job. | Yes |
partial-success | Some steps failed with continue_on_error = true; the job kept going. | Yes |
aborted | An operator clicked Abort or the job-level timeout_seconds fired. | Yes |
skipped | The job was due to fire but a dependency / single-instance lock prevented it. | Yes |
A row's reason column (visible in the detail drawer) carries the qualifier — dependency-failed: <name>, single-instance, previous still running, timeout after 1800s, etc.
The run detail drawer
Clicking a row opens a drawer on the right with three sections.
Header
- Run id, job name, status pill, started / finished timestamps, duration, trigger.
- An Abort button (visible only while
status = running, only to users withjob:<name>:abortor the globaljob:*). - A Re-run button (visible when
statusis final) that triggers a new run with the sameparamssnapshot.
Per-step timeline
A vertical list, one entry per step in declared order:
| Field | Source |
|---|---|
| Step name + type pill | From jobs.toml. |
| Status pill | succeeded / failed / running / skipped. |
| Duration | Started → ended (live counter while running). |
| Input snapshot | The params / kwargs map after all ${...} substitution. Useful when the same step ran differently across retries. |
| Output snapshot | The step result — rows_affected, row_count, the first 100 rows for SQL, the full response body for HTTP (truncated to 100 KiB). |
| Error | When status = failed, the exception class, message and the relevant traceback. |
Each step entry expands inline to show the input / output / error.
Live log tail
A scrolling pane below the timeline streams every log line from the run — both the framework's structured events (step started, step finished, retry triggered) and the messages emitted by python step callables via logging.getLogger(__name__).
The pane is streamed over Socket.IO while the run is in flight; once the run finishes, it shows the persisted log lines from ly2_job_logs. Each line carries:
- A timestamp (ms precision in the server timezone, rendered in the operator's timezone).
- A level (DEBUG / INFO / WARNING / ERROR), colour-coded.
- The logger name (e.g.
billing.invoicing). - The message.
The toolbar at the top of the pane exposes:
- A level filter — show only
WARNINGandERROR, for example. - A search box — substring filter on the message.
- A ↻ Follow toggle — auto-scroll to the bottom as new lines arrive.
- A Download button — exports the full log as a
.logfile.
Aborting a run
Clicking Abort in the detail drawer:
- Marks the run with
status = abortingin the database. - Sends an
asyncio.CancelledErrorinto the in-flight step task. - The step's callable (or the framework's step executor) reacts:
- SQL queries are cancelled at the driver level (
asyncpg/oracledbboth support cancel). - HTTP calls are aborted (the underlying connection is closed).
pythonstep callables that useawaitcheckpoints see theCancelledError; synchronous callables run to completion of the current iteration.
- SQL queries are cancelled at the driver level (
- The step is recorded with
status = aborted. - The job is recorded with
status = abortedand no further steps run.
A callable that catches CancelledError silently can prevent the abort from taking effect — don't do this in plugin code. The default behaviour (let the exception propagate) is the right one.
Replaying / re-running
The Re-run button on a finished run creates a new run with:
- The same
paramssnapshot. - A new
run_id. triggered_by = "user:<operator>"andreplay_of = <original run_id>.
The replay_of link is visible in the detail drawer as "↻ Replay of r-1840". This is useful for nightly jobs that failed because of a transient issue and need a fresh attempt without waiting for the next cron fire.
REST API
| Endpoint | Purpose |
|---|---|
GET /admin/jobs/runs?job=<name>&status=<list>&from=&to=&limit= | Runs table — paginated, filtered. |
GET /admin/jobs/runs/<run_id> | Single run with the full step list and the latest 1000 log lines. |
POST /admin/jobs/<name>/run | Trigger a manual run. Body accepts params overrides. |
POST /admin/jobs/runs/<run_id>/abort | Abort an in-flight run. |
POST /admin/jobs/runs/<run_id>/replay | Re-run with the same params. |
GET /admin/jobs/runs/<run_id>/logs?level=&follow= | Stream the log tail. follow=true upgrades to Socket.IO. |
Every endpoint requires the job:<name> permission for the targeted job (or job:*). The list endpoint returns only runs of jobs the caller can see — pruned automatically.
Retention
Run rows and log lines are pruned by an internal job (_default/cleanup-job-history) that fires once a day at 03:00. The retention policy:
| Setting | Default | Meaning |
|---|---|---|
[jobs] history_days | 90 | Run rows older than N days are deleted. Step rows and log lines follow. |
[jobs] history_keep_failed | true | Failed / aborted runs are kept past the retention window — they often inform an incident review. |
[jobs] log_truncate_kb | 100 | Log lines past N KiB total per run are truncated (the truncated message says so explicitly). |
For a long-term audit trail, route the log stream to your central log aggregator (Loki, Splunk, Datadog) via LIBERTY_LOG_JSON=1 and treat the framework's internal history as an operational cache.
The Technical dashboard
The Settings → Technical tab (gated by settings:technical) shows a live overview:
| Panel | Content |
|---|---|
| In-flight runs | Every status = running job, with elapsed time and current step. Click a row to open the detail drawer. |
| Recent failures | Last 20 failed / aborted runs. |
| Scheduler heartbeat | Last fire-time, next fire-time, queue depth. Useful to confirm the scheduler is alive on a quiet day. |
| Pool stats | Per-pool open / idle / in-use connections — surfaces pool exhaustion that would otherwise stall jobs. |
This dashboard is the first place to look when "a job didn't run" — the scheduler heartbeat tells you whether the framework even tried.
Tips & best practices
- Watch the Recent failures panel. A single failed run a week is normal; a flap (one failure per night) is worth chasing.
- Use
log.warning()for "ok but unusual". A row that takes 10× longer than usual deserves a WARNING in the log, not silence. The level filter in the log tail makes them easy to find. - Tag the run with a meaningful identifier. A
pythonstep that logslog.info("billing.invoicing.run period=2026-05 drafts=42 dry_run=False")makes the run history greppable. - Don't keep your only history in the framework. Forward the JSON log to your aggregator — outages on the install side shouldn't take the audit trail with them.
- Abort, don't kill. Use the Abort button rather than
kill -9on the framework process — the abort flow records the reason and preserves the partial state for the postmortem.
What's next
jobs.toml— the job declaration that drives every run.- Step types — what each step records in its result.
- REST API reference → Jobs — the full endpoint contract.