Skip to main content

Runs & monitoring

Every Nomaflow run leaves a trail: one row in ly2_job_runs, one row per step in ly2_step_runs, every log line in ly2_job_logs. The Settings → Jobs page is where this trail is browsed, filtered and streamed live.

This page covers the run history table, the per-run detail drawer, the live log tail, the abort flow and the retention policy.


At a glance

Job runs · last 7 days
Job ▾Status ▾↻ Refresh
Run id
Job
Status
Started
Duration
Trigger
r-1842
billing-nightly-rebuild
running
02:00:03
2m 14s ⏱
cron
r-1841
crm-hourly-sync
succeeded
02:00:00
0m 38s
cron
r-1840
ad-sync
failed
01:30:00
1m 02s
cron
r-1839
billing-monthly-close
skipped
00:00:00
cron

The runs table

ColumnSource
Run idThe unique run_id (e.g. r-1842). Clickable — opens the detail drawer.
JobJob name. The chip filter at the top narrows to one job.
StatusColoured pill — see statuses below.
StartedWhen the run began (local time of the operator). Sortable.
DurationStarted → ended (or "running" with a live counter for in-flight runs).
Triggercron, manual, api, cli — and on hover the user identifier when manual.

The table defaults to the last 7 days, sorted by Started descending. The toolbar exposes:

  • A Job dropdown — narrow to one job.
  • A Status dropdown — multi-select (running, succeeded, failed, aborted, skipped, partial-success).
  • A Date range picker — week / month / custom.
  • A ↻ Refresh button — also auto-refreshes every 5 s while at least one run is running.

Statuses

StatusMeaningFinal?
runningThe run is in flight.No
succeededEvery step completed without error.Yes
failedA step exhausted its retries and stopped the job.Yes
partial-successSome steps failed with continue_on_error = true; the job kept going.Yes
abortedAn operator clicked Abort or the job-level timeout_seconds fired.Yes
skippedThe job was due to fire but a dependency / single-instance lock prevented it.Yes

A row's reason column (visible in the detail drawer) carries the qualifier — dependency-failed: <name>, single-instance, previous still running, timeout after 1800s, etc.


The run detail drawer

Clicking a row opens a drawer on the right with three sections.

  • Run id, job name, status pill, started / finished timestamps, duration, trigger.
  • An Abort button (visible only while status = running, only to users with job:<name>:abort or the global job:*).
  • A Re-run button (visible when status is final) that triggers a new run with the same params snapshot.

Per-step timeline

A vertical list, one entry per step in declared order:

FieldSource
Step name + type pillFrom jobs.toml.
Status pillsucceeded / failed / running / skipped.
DurationStarted → ended (live counter while running).
Input snapshotThe params / kwargs map after all ${...} substitution. Useful when the same step ran differently across retries.
Output snapshotThe step result — rows_affected, row_count, the first 100 rows for SQL, the full response body for HTTP (truncated to 100 KiB).
ErrorWhen status = failed, the exception class, message and the relevant traceback.

Each step entry expands inline to show the input / output / error.

Live log tail

A scrolling pane below the timeline streams every log line from the run — both the framework's structured events (step started, step finished, retry triggered) and the messages emitted by python step callables via logging.getLogger(__name__).

The pane is streamed over Socket.IO while the run is in flight; once the run finishes, it shows the persisted log lines from ly2_job_logs. Each line carries:

  • A timestamp (ms precision in the server timezone, rendered in the operator's timezone).
  • A level (DEBUG / INFO / WARNING / ERROR), colour-coded.
  • The logger name (e.g. billing.invoicing).
  • The message.

The toolbar at the top of the pane exposes:

  • A level filter — show only WARNING and ERROR, for example.
  • A search box — substring filter on the message.
  • A ↻ Follow toggle — auto-scroll to the bottom as new lines arrive.
  • A Download button — exports the full log as a .log file.

Aborting a run

Clicking Abort in the detail drawer:

  1. Marks the run with status = aborting in the database.
  2. Sends an asyncio.CancelledError into the in-flight step task.
  3. The step's callable (or the framework's step executor) reacts:
    • SQL queries are cancelled at the driver level (asyncpg / oracledb both support cancel).
    • HTTP calls are aborted (the underlying connection is closed).
    • python step callables that use await checkpoints see the CancelledError; synchronous callables run to completion of the current iteration.
  4. The step is recorded with status = aborted.
  5. The job is recorded with status = aborted and no further steps run.

A callable that catches CancelledError silently can prevent the abort from taking effect — don't do this in plugin code. The default behaviour (let the exception propagate) is the right one.


Replaying / re-running

The Re-run button on a finished run creates a new run with:

  • The same params snapshot.
  • A new run_id.
  • triggered_by = "user:<operator>" and replay_of = <original run_id>.

The replay_of link is visible in the detail drawer as "↻ Replay of r-1840". This is useful for nightly jobs that failed because of a transient issue and need a fresh attempt without waiting for the next cron fire.


REST API

EndpointPurpose
GET /admin/jobs/runs?job=<name>&status=<list>&from=&to=&limit=Runs table — paginated, filtered.
GET /admin/jobs/runs/<run_id>Single run with the full step list and the latest 1000 log lines.
POST /admin/jobs/<name>/runTrigger a manual run. Body accepts params overrides.
POST /admin/jobs/runs/<run_id>/abortAbort an in-flight run.
POST /admin/jobs/runs/<run_id>/replayRe-run with the same params.
GET /admin/jobs/runs/<run_id>/logs?level=&follow=Stream the log tail. follow=true upgrades to Socket.IO.

Every endpoint requires the job:<name> permission for the targeted job (or job:*). The list endpoint returns only runs of jobs the caller can see — pruned automatically.


Retention

Run rows and log lines are pruned by an internal job (_default/cleanup-job-history) that fires once a day at 03:00. The retention policy:

SettingDefaultMeaning
[jobs] history_days90Run rows older than N days are deleted. Step rows and log lines follow.
[jobs] history_keep_failedtrueFailed / aborted runs are kept past the retention window — they often inform an incident review.
[jobs] log_truncate_kb100Log lines past N KiB total per run are truncated (the truncated message says so explicitly).

For a long-term audit trail, route the log stream to your central log aggregator (Loki, Splunk, Datadog) via LIBERTY_LOG_JSON=1 and treat the framework's internal history as an operational cache.


The Technical dashboard

The Settings → Technical tab (gated by settings:technical) shows a live overview:

PanelContent
In-flight runsEvery status = running job, with elapsed time and current step. Click a row to open the detail drawer.
Recent failuresLast 20 failed / aborted runs.
Scheduler heartbeatLast fire-time, next fire-time, queue depth. Useful to confirm the scheduler is alive on a quiet day.
Pool statsPer-pool open / idle / in-use connections — surfaces pool exhaustion that would otherwise stall jobs.

This dashboard is the first place to look when "a job didn't run" — the scheduler heartbeat tells you whether the framework even tried.


Tips & best practices

  • Watch the Recent failures panel. A single failed run a week is normal; a flap (one failure per night) is worth chasing.
  • Use log.warning() for "ok but unusual". A row that takes 10× longer than usual deserves a WARNING in the log, not silence. The level filter in the log tail makes them easy to find.
  • Tag the run with a meaningful identifier. A python step that logs log.info("billing.invoicing.run period=2026-05 drafts=42 dry_run=False") makes the run history greppable.
  • Don't keep your only history in the framework. Forward the JSON log to your aggregator — outages on the install side shouldn't take the audit trail with them.
  • Abort, don't kill. Use the Abort button rather than kill -9 on the framework process — the abort flow records the reason and preserves the partial state for the postmortem.

What's next