Runs & monitoring

Every Nomaflow run leaves a trail: one row in ly2_job_runs, one row per step in ly2_step_runs, every log line in ly2_job_logs. The Settings → Jobs page is where this trail is browsed, filtered and streamed live.

This page covers the run history table, the per-run detail drawer, the live log tail, the abort flow and the retention policy.

At a glance

Job runs · last 7 days

Job ▾Status ▾↻ Refresh

Run id

Job

Status

Started

Duration

Trigger

r-1842

billing-nightly-rebuild

running

02:00:03

2m 14s ⏱

cron

r-1841

crm-hourly-sync

succeeded

02:00:00

0m 38s

cron

r-1840

ad-sync

failed

01:30:00

1m 02s

cron

r-1839

billing-monthly-close

skipped

00:00:00

—

cron

The runs table

Column	Source
Run id	The unique `run_id` (e.g. `r-1842`). Clickable — opens the detail drawer.
Job	Job name. The chip filter at the top narrows to one job.
Status	Coloured pill — see statuses below.
Started	When the run began (local time of the operator). Sortable.
Duration	Started → ended (or "running" with a live counter for in-flight runs).
Trigger	`cron`, `manual`, `api`, `cli` — and on hover the user identifier when manual.

The table defaults to the last 7 days, sorted by Started descending. The toolbar exposes:

A Job dropdown — narrow to one job.
A Status dropdown — multi-select (running, succeeded, failed, aborted, skipped, partial-success).
A Date range picker — week / month / custom.
A ↻ Refresh button — also auto-refreshes every 5 s while at least one run is running.

Statuses

Status	Meaning	Final?
`running`	The run is in flight.	No
`succeeded`	Every step completed without error.	Yes
`failed`	A step exhausted its retries and stopped the job.	Yes
`partial-success`	Some steps failed with `continue_on_error = true`; the job kept going.	Yes
`aborted`	An operator clicked Abort or the job-level `timeout_seconds` fired.	Yes
`skipped`	The job was due to fire but a dependency / single-instance lock prevented it.	Yes

A row's reason column (visible in the detail drawer) carries the qualifier — dependency-failed: <name>, single-instance, previous still running, timeout after 1800s, etc.

The run detail drawer

Clicking a row opens a drawer on the right with three sections.

Run id, job name, status pill, started / finished timestamps, duration, trigger.
An Abort button (visible only while status = running, only to users with job:<name>:abort or the global job:*).
A Re-run button (visible when status is final) that triggers a new run with the same params snapshot.

Per-step timeline

A vertical list, one entry per step in declared order:

Field	Source
Step name + type pill	From `jobs.toml`.
Status pill	`succeeded` / `failed` / `running` / `skipped`.
Duration	Started → ended (live counter while running).
Input snapshot	The `params` / `kwargs` map after all `${...}` substitution. Useful when the same step ran differently across retries.
Output snapshot	The step result — `rows_affected`, `row_count`, the first 100 rows for SQL, the full response body for HTTP (truncated to 100 KiB).
Error	When `status = failed`, the exception class, message and the relevant traceback.

Each step entry expands inline to show the input / output / error.

Live log tail

A scrolling pane below the timeline streams every log line from the run — both the framework's structured events (step started, step finished, retry triggered) and the messages emitted by python step callables via logging.getLogger(__name__).

The pane is streamed over Socket.IO while the run is in flight; once the run finishes, it shows the persisted log lines from ly2_job_logs. Each line carries:

A timestamp (ms precision in the server timezone, rendered in the operator's timezone).
A level (DEBUG / INFO / WARNING / ERROR), colour-coded.
The logger name (e.g. billing.invoicing).
The message.

The toolbar at the top of the pane exposes:

A level filter — show only WARNING and ERROR, for example.
A search box — substring filter on the message.
A ↻ Follow toggle — auto-scroll to the bottom as new lines arrive.
A Download button — exports the full log as a .log file.

Aborting a run

Clicking Abort in the detail drawer:

Marks the run with status = aborting in the database.
Sends an asyncio.CancelledError into the in-flight step task.
The step's callable (or the framework's step executor) reacts:
- SQL queries are cancelled at the driver level (asyncpg / oracledb both support cancel).
- HTTP calls are aborted (the underlying connection is closed).
- python step callables that use await checkpoints see the CancelledError; synchronous callables run to completion of the current iteration.
The step is recorded with status = aborted.
The job is recorded with status = aborted and no further steps run.

A callable that catches CancelledError silently can prevent the abort from taking effect — don't do this in plugin code. The default behaviour (let the exception propagate) is the right one.

Replaying / re-running

The Re-run button on a finished run creates a new run with:

The same params snapshot.
A new run_id.
triggered_by = "user:<operator>" and replay_of = <original run_id>.

The replay_of link is visible in the detail drawer as "↻ Replay of r-1840". This is useful for nightly jobs that failed because of a transient issue and need a fresh attempt without waiting for the next cron fire.

REST API

Endpoint	Purpose
`GET /admin/jobs/runs?job=<name>&status=<list>&from=&to=&limit=`	Runs table — paginated, filtered.
`GET /admin/jobs/runs/<run_id>`	Single run with the full step list and the latest 1000 log lines.
`POST /admin/jobs/<name>/run`	Trigger a manual run. Body accepts `params` overrides.
`POST /admin/jobs/runs/<run_id>/abort`	Abort an in-flight run.
`POST /admin/jobs/runs/<run_id>/replay`	Re-run with the same params.
`GET /admin/jobs/runs/<run_id>/logs?level=&follow=`	Stream the log tail. `follow=true` upgrades to Socket.IO.

Every endpoint requires the job:<name> permission for the targeted job (or job:*). The list endpoint returns only runs of jobs the caller can see — pruned automatically.

Retention

Run rows and log lines are pruned by an internal job (_default/cleanup-job-history) that fires once a day at 03:00. The retention policy:

Setting	Default	Meaning
`[jobs] history_days`	`90`	Run rows older than N days are deleted. Step rows and log lines follow.
`[jobs] history_keep_failed`	`true`	Failed / aborted runs are kept past the retention window — they often inform an incident review.
`[jobs] log_truncate_kb`	`100`	Log lines past N KiB total per run are truncated (the truncated message says so explicitly).

For a long-term audit trail, route the log stream to your central log aggregator (Loki, Splunk, Datadog) via LIBERTY_LOG_JSON=1 and treat the framework's internal history as an operational cache.

The Technical dashboard

The Settings → Technical tab (gated by settings:technical) shows a live overview:

Panel	Content
In-flight runs	Every `status = running` job, with elapsed time and current step. Click a row to open the detail drawer.
Recent failures	Last 20 `failed` / `aborted` runs.
Scheduler heartbeat	Last fire-time, next fire-time, queue depth. Useful to confirm the scheduler is alive on a quiet day.
Pool stats	Per-pool open / idle / in-use connections — surfaces pool exhaustion that would otherwise stall jobs.

This dashboard is the first place to look when "a job didn't run" — the scheduler heartbeat tells you whether the framework even tried.

Tips & best practices

Watch the Recent failures panel. A single failed run a week is normal; a flap (one failure per night) is worth chasing.
Use log.warning() for "ok but unusual". A row that takes 10× longer than usual deserves a WARNING in the log, not silence. The level filter in the log tail makes them easy to find.
Tag the run with a meaningful identifier. A python step that logs log.info("billing.invoicing.run period=2026-05 drafts=42 dry_run=False") makes the run history greppable.
Don't keep your only history in the framework. Forward the JSON log to your aggregator — outages on the install side shouldn't take the audit trail with them.
Abort, don't kill. Use the Abort button rather than kill -9 on the framework process — the abort flow records the reason and preserves the partial state for the postmortem.

What's next

jobs.toml — the job declaration that drives every run.
Step types — what each step records in its result.
REST API reference → Jobs — the full endpoint contract.

At a glance​

The runs table​

Statuses​

The run detail drawer​

Header​

Per-step timeline​

Live log tail​

Aborting a run​

Replaying / re-running​

REST API​

Retention​

The Technical dashboard​

Tips & best practices​

What's next​

At a glance

The runs table

Statuses

The run detail drawer

Header

Per-step timeline

Live log tail

Aborting a run

Replaying / re-running

REST API

Retention

The Technical dashboard

Tips & best practices

What's next